LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

University of Waterloo

Abstract

In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the "needle" unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced heavy retriever and light reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a "long retriever" and a "long reader". LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 600K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units (≈ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA (full-wiki), which is on par with the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.

Traditional RAG vs. LongRAG.

Figure 1: Traditional RAG vs. LongRAG.

Traditional RAG operates on short retrieval units, where retriever needs to scan over massive amount of units to find the relevant piece. In contrast, LongRAG operates on long retrieval units (30x longer). Retriever has a much less workload, which significantly boosts the recall score. LongRAG fully exploits the ability of long-context language models (reader) to achieve strong performance.

Framework

MY ALT TEXT

Figure 2: Our proposed LongRAG framework is comprised of two components: the Long Retriever and the Long Reader.

On the left side, it shows that the long retrieval unit is grouped by Wikipedia documents through hyperlinks. Each retrieval unit contains an average of 4K tokens, corresponding to multiple related documents. On the right side, it shows a multi-hop question answer test case from HotpotQA. The final result can be achieved by using only a few retrieval units, which is then fed into a long reader.

Retrieval Results

Retrieval Unit Corpus Size Num of Retrieval Units Average Num of Tokens Answer Recall (AR)
Corpus Test Set
Passage 22M 1 120 130 52.24
100 12K 14K 89.92
200 24K 28K 91.30
Document 3M 1 820 4K 69.45
5 4K 18K 85.37
10 8K 34K 88.12
Grouped Documents 600K 1 4K 6K 71.69
4 16K 25K 86.30
8 32K 50K 88.53

Employing a long-context retriever (with an average number of tokens for each retrieval unit up to 6K) compresses the corpus size by up to 30 times (from 22M to 600K), enhancing top-1 answer recall by approximately 20 points (from 52.24 to 71.69). Furthermore, long-context retrieval requires significantly fewer retrieval units (10 times fewer) to achieve comparable results. Therefore, integrating long-context retrieval significantly alleviates the burden on the retriever model.


Retrieval Unit Corpus Size Num of Retrieval Units Average Num of Tokens Recall (R) Answer Recall (AR)
Corpus Test Set
Document 5.2M 2 130 200 30.01 47.75
100 6.5K 10K 74.84 84.67
200 13K 20K 79.68 88.34
Grouped Documents 500K 2 1K 8K 56.30 72.49
8 4K 29K 74.71 84.40

Similar to the findings on NQ, a long-context retrieval could significantly alleviate the burden on the retriever component within the entire RAG framework.

QA Results



Closed-Book Fully-supervised RAG No Fine-tuning RAG

Method EM(Exact Match Rate)
GPT-4-Turbo41.2
Gemini-1.5-Pro47.8
Claude-3-Opus49.2
REALM40.4
DPR41.5
RAG44.5
RETRO45.5
RePAQ47.8
FiD51.4
EMDR52.5
Atlas64.0
REPLUG45.5
LongRAG (Gemini-1.5-Pro; Recall 4 units)58.6
LongRAG (GPT-4o; Recall 4 units)62.7

The table shows the QA results on the NQ dataset. We compare the results with three groups of baselines: closed-book, which involves directly prompting state-of-the-art LLMs with 16-shot in-context examples; fully-supervised RAG, where the RAG framework is used and the model is fully supervised and trained on the training data; and No Fine-tuning RAG, which employs the RAG framework without any tuning


Method EM(Exact Match Rate)
GPT-4-Turbo42.4
Gemini-1.5-Pro33.9
Claude-3-Opus32.8
CogQA 37.1
DrKIT42.1
Transformer-XH51.6
QAMAT+57.6
HGN59.7
PathRetriever60.0
HopRetrieve62.1
MDR62.3
HopRetrieve-plus66.5
AISO68.1
COS68.2
DSP51.4
PromptRank55.7
LongRAG (Gemini-1.5-Pro; Recall 8 units)57.5
LongRAG (GPT-4o; Recall 8 units)64.3

The table shows the QA results on the HotpotQA dev set. We compare the results with three groups of baselines: closed-book, which involves directly prompting state-of-the-art LLMs with 16-shot in-context examples; fully-supervised RAG, where the RAG framework is used and the model is fully supervised and trained on the training data; and No Fine-tuning RAG, which employs the RAG framework without any tuning

BibTeX

Please kindly cite our paper if you use our code, data, models or results:


@article{jiang2024longrag
  title={LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs},
  author={Ziyan Jiang, Xueguang Ma, Wenhu Chen},
  journal={arXiv preprint arXiv:2406.15319},
  year={2024},
  url={https://arxiv.org/abs/2406.15319}
}