RAG System Overview

Retrieval Augmented Generation

LLMs such as GPT family, Llama, Qwen or deepseek … are trained on large data that is gathered mainly by crawling the internet. This step of training on large data is called pretraining which is very costly. For example llama 3 with 1 billion parameters requires 314k GPU hour of H100 with 80GB. The knowledge of these LLMs is constrained by the website crawled (i.e data trained on) and time of crawling. This means when asking these LLMs about information that are resticted, for example companies internal documents or information that are occured after the crawling time, they will fail to answer correctly i.e they will hallucinates. Addressing these challenge is very crucial for LLMs to be more valuable, especially within companies.

We can think of updating LLMs with new information by fine-tuning them using full fine-tuning or techniques such as LoRA. But fine-tuning has also a cost. Imagine an Investor needs to have a summary of news everyday about differents topics e.g politics, finance, regulations and technology and within different countries. For this scenario when the data is updated on a regular basis, finetuning is not practical. To overcome this challenge, Retrieval Augmented Generation tend to be a suitable solution. Retrieval Augmented Generation take advantage of the ability of LLMs to understand the information provided within the prompt, called “context”, and mix it with retrieval techniques which provides information similar to the query from an external database.

RAG Framework

Basic RAG Workflow