Source

stackoverflow article
RAG Paper

Basic Overview

tags: rag, llm, retieval augumented generation, vectordb, langchain, openai, chatgpt

Issues with using just LLM's -

Out of date
Hallucination if they don't know the answer

from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator
loader = WebBaseLoader("https://www.promptingguide.ai/techniques/rag")
index = VectorstoreIndexCreator().from_loaders([loader])
index.query("What is RAG?")

Basic Architecture

Orchestration Layer

It receives the user input and interacts with all the tooling, sends the prompt to LLM and retrieves the results.

Tools: LangChain, Semantic kernels, python code

Retrieval Tools

Ground the LLM to user prompt. It contains knowledge bases and API based retrieval systems

LLMs

The prompts are sent to this model

Closed Models (consumed using an API): Claude, OpenAI GPT4 Open source: Llama2, mistral, phi, gemma

Tools for open source models:

Ollama
Huggingface

Knowledge Based Retrieval

VectorStore (not a requirement for RAG), they store vectors and support retrieval based on vector not exact match.

Raw Data --> VectorStore (ETL Pipeline)

ETL Pipeline

Aggregate Source Documents
Clean document content - anything such as PHI or PII that needs to be removed should be done here.
Load Documents using tools such as unstructured, LlamaIndex, LangChain Document Loaders - Loader might have caveats so read about them before using them.
Split: Split the content into chunks which can fit in prompt but also not lose meaning. If nothing works, you might need to write your own text splitter.
Create embedding for text chunks - numerical representation of one text's chunk's relative position and relationship to other chunks. Tools could be OpenAI embeddings, llamaindex or langChain might also have some. Also, things like sentencetransformer can be used.
Store - add the embeddings to vector store such as PineCone, Weaviate, FAISS, Chroma etc. or just store the vectors on filesystem.

Vector Store do allow updating if we need to add/remove source documents. FineTuning on the other hand won't allow removing content.

Processing something like patient notes, might need an efficient way to create document indexing.

API Based Retrieval

The orchestration layer can also add context using an programmatic API endpoint.

Prompting with [[RAG]]

The prompting might need -

Assistant Message: Instruction to the chatbot History: Maybe conversation history Context: Find relevant chunks from retrieval tools, label different types of data sources if you have them User Prompt: The question that the user wants to ask

The prompts might need cleaning to remove PII etc.

Use tiktoken to make sure you are not exceeding API limits.

Improving Performance

Garbage in, Garbage out - make sure the data is correctly captured e.g. excel headers are captured.
Tune the split strategy - try with differently split vector stores
Tune system prompt
Filter Vector store results
Try different embedding models or fine-tune your own model
- https://huggingface.co/spaces/mteb/leaderboard

Tool selection

Vector Databases

Filesystem
OpenAI Assistant
Milvus
Chroma
Pinecone
postgres
Mongodb

Metrics

It can be non-trivial to measure the performance of a RAG pipeline as it consists of multiple steps.

RAGAS
TruLens

Learnings

First deployment

embedding model: text-embedding-large 3072 dimension (azure)
llm mode: GPT4 (azure)
Vector DB: milvus

I had to fix milvus MilvusVectorStore(dim=3072, overwrite=True) to match the embedding model. (is it the same dimension as the original GPT paper)

Model specifications Our model largely follows the original transformer work [62]. We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states.

Todos

How to retrieve relevant sections of test?

CS 25 Lecture

LLM issues:

Hallucination
Attribution
staleness
1. old data
revisions
1. remove information - patient removed conscent
customizations

Solution:

Couple to external memory

Contextualization

Paradigms

RAG is an open book setting.
Parametric, semi-parametric

Architectures

Train: Update the LM? Query encoder document encoder update both? all? pretrain from scratch?

Test: system rather than model - multiple models different or same index?

Frozen RAG

static - works because of incontext learning

Sparse Retrieval

TF-IDF [[BM25]] - Best Match 25

Dense Retriever

Not about specific word OrQA (Lee et al 2019) Dense Passage Retriever (DPR) - Karpukhin, Oguz 2020

MIPS: Max Inner Product Search

FAISS (Johnson el al 2019) : ANN algorithms

faster search using centroids of vectors ColBERT (Khattab et al. 2020): Named on [[Stephen Colbert]] shows
Late interaction

SOTA

SPLADE: Sparse meet dense (Formal et al 2021) DRAGON: Lin et al 2023 - <<- use this current off the shelf Hybrid Search - might be the way to go

Nice Paper: RePlug Shi et al 2023 - uses perplexity and KL divergence In-Context RALM - also works well - Guu et al

Contextualization of both retriever and generator

RAG (Lewis et al 2020) Generator and retriever are both updated. The paper says that frozen doesn't work very well.
FiD (Izacard and Grave 2020) Better for larger number of documents.
kNN-LM (Khandelwal 2019) Very simple idea and scales really well if you have huge retrieval corpus Retro (Borgeaud et al 2022) - pretrain from scratch
Retro++ (Wang Ping Xu et al 2023) - in context RAG with retro

Why is there no retro open source - might not have been working very well?

Contextualization All the way

REALM (Guu et al 2020)

The OG of non-frozen dense retrieval augmented LMs Really visionary work Downside: not really generative, BERT all the way

Atlas: Deep Dive

How to train the retriever?

FiD style "attention distillation"
EMDR How to we pretrain?
PrefixLM
MLM
Title to section generation How do we update the retriever
Query side only does work well
reranker
full update both sides

Mistral works so much better because they are trained on large data and as they have sliding window attention so they don't need attend everything.

RAG vs Long Context

Attending of all over the long context for a specific question is inefficient and the underlying architecture is using sparse attention which makes it similar to RAG

When to retrieve?

FLARE "active retrieval augmentation"
RAG-token, RAG-sequence, Retro-chunk

sometimes, i want to retrieve and sometimes i just want to generate - FLARE paper does this when training LLM to know when to retrieve and not

TRIME (Zhong et al 2022) - how big is the index?
SILO (Min, Gururangan et al 2023) - isolating legal risk with a non-parametric data store
Lost in the Middle (Liu et al 2023)
- Retrieved contexts in the middle gets "lost"
WebGPT Nakano et al 2021
ToolFormer (Shick et al 2021)
Self-RAG (Asai et al 2023)

Instruction Tuning

IntructRetro (Wang et al 2023) / RA-DIT (Lin, Chen, Chen et al 2023)

Advanced Frozen RAG

Child parent recursive retriever
Hybrid search
using zero-shot LLM
hyDE: Hypothetical Document Embedding (Gao, Ma, et al)

Future

Joint from scratch pretraining is still unexplored
What do scaling looks like?
Rerankers work well and works better than vector databases -- so we might not need vector databases
How do we measure the performance of RAG - currently we measure downstream performance
Multimodal RAG (Gur et al 2021, Yasunga et al 2023)

RAG 2.0

System over model
Optimize it all the way - why not backprop into chunker?
Trading off cost and quality
Zero shot domain generalization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval Augmented Generation.md

Retrieval Augmented Generation.md

Basic Overview

Basic Architecture

Orchestration Layer

Retrieval Tools

LLMs

Knowledge Based Retrieval

ETL Pipeline

API Based Retrieval

Prompting with [[RAG]]

Improving Performance

Tool selection

Vector Databases

Metrics

Learnings

First deployment

Todos

CS 25 Lecture

Contextualization

Architectures

Frozen RAG

Sparse Retrieval

Dense Retriever

SOTA

Contextualization of both retriever and generator

Contextualization All the way

REALM (Guu et al 2020)

Atlas: Deep Dive

RAG vs Long Context

When to retrieve?

Instruction Tuning

Advanced Frozen RAG

Future

RAG 2.0

Files

Retrieval Augmented Generation.md

Latest commit

History

Retrieval Augmented Generation.md

File metadata and controls

Basic Overview

Basic Architecture

Orchestration Layer

Retrieval Tools

LLMs

Knowledge Based Retrieval

ETL Pipeline

API Based Retrieval

Prompting with [[RAG]]

Improving Performance

Tool selection

Vector Databases

Metrics

Learnings

First deployment

Todos

CS 25 Lecture

Contextualization

Architectures

Frozen RAG

Sparse Retrieval

Dense Retriever

SOTA

Contextualization of both retriever and generator

Contextualization All the way

REALM (Guu et al 2020)

Atlas: Deep Dive

RAG vs Long Context

When to retrieve?

Instruction Tuning

Advanced Frozen RAG

Future

RAG 2.0