In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs.
Such a design forces the retriever to search over a large corpus to find the "needle" unit. In contrast, the readers only need to extract answers from the
short retrieved units. Such an imbalanced heavy retriever and light reader design can lead to sub-optimal performance. In order to alleviate the imbalance,
we propose a new framework LongRAG, consisting of a "long retriever" and a "long reader". LongRAG processes the entire Wikipedia into 4K-token units, which is
30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 600K. This significantly lowers the burden of retriever,
which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then
we feed the top-k retrieved units (≈ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG
achieves an EM of 62.7% on NQ and 64.3% on HotpotQA (full-wiki), which is on par with the SoTA model. Our study offers insights into the future roadmap for combining
RAG with long-context LLMs.
Traditional RAG operates on short retrieval units, where retriever needs to scan over massive amount of units to find the relevant piece. In contrast, LongRAG operates on long retrieval units (30x longer). Retriever has a much less workload, which significantly boosts the recall score. LongRAG fully exploits the ability of long-context language models (reader) to achieve strong performance.
On the left side, it shows that the long retrieval unit is grouped by Wikipedia documents through hyperlinks. Each retrieval unit contains an average of 4K tokens, corresponding to multiple related documents. On the right side, it shows a multi-hop question answer test case from HotpotQA. The final result can be achieved by using only a few retrieval units, which is then fed into a long reader.
Retrieval Unit | Corpus Size | Num of Retrieval Units | Average Num of Tokens | Answer Recall (AR) | |
---|---|---|---|---|---|
Corpus | Test Set | ||||
Passage | 22M | 1 | 120 | 130 | 52.24 |
100 | 12K | 14K | 89.92 | ||
200 | 24K | 28K | 91.30 | ||
Document | 3M | 1 | 820 | 4K | 69.45 |
5 | 4K | 18K | 85.37 | ||
10 | 8K | 34K | 88.12 | ||
Grouped Documents | 600K | 1 | 4K | 6K | 71.69 |
4 | 16K | 25K | 86.30 | ||
8 | 32K | 50K | 88.53 |
Employing a long-context retriever (with an average number of tokens for each retrieval unit up to 6K) compresses the corpus size by up to 30 times (from 22M to 600K), enhancing top-1 answer recall by approximately 20 points (from 52.24 to 71.69). Furthermore, long-context retrieval requires significantly fewer retrieval units (10 times fewer) to achieve comparable results. Therefore, integrating long-context retrieval significantly alleviates the burden on the retriever model.
Retrieval Unit | Corpus Size | Num of Retrieval Units | Average Num of Tokens | Recall (R) | Answer Recall (AR) | |
---|---|---|---|---|---|---|
Corpus | Test Set | |||||
Document | 5.2M | 2 | 130 | 200 | 30.01 | 47.75 |
100 | 6.5K | 10K | 74.84 | 84.67 | ||
200 | 13K | 20K | 79.68 | 88.34 | ||
Grouped Documents | 500K | 2 | 1K | 8K | 56.30 | 72.49 |
8 | 4K | 29K | 74.71 | 84.40 |
Similar to the findings on NQ, a long-context retrieval could significantly alleviate the burden on the retriever component within the entire RAG framework.
Method | EM(Exact Match Rate) |
---|---|
GPT-4-Turbo | 41.2 |
Gemini-1.5-Pro | 47.8 |
Claude-3-Opus | 49.2 |
REALM | 40.4 |
DPR | 41.5 |
RAG | 44.5 |
RETRO | 45.5 |
RePAQ | 47.8 |
FiD | 51.4 |
EMDR | 52.5 |
Atlas | 64.0 |
REPLUG | 45.5 |
LongRAG (Gemini-1.5-Pro; Recall 4 units) | 58.6 |
LongRAG (GPT-4o; Recall 4 units) | 62.7 |
The table shows the QA results on the NQ dataset. We compare the results with three groups of baselines: closed-book, which involves directly prompting state-of-the-art LLMs with 16-shot in-context examples; fully-supervised RAG, where the RAG framework is used and the model is fully supervised and trained on the training data; and No Fine-tuning RAG, which employs the RAG framework without any tuning
Method | EM(Exact Match Rate) |
---|---|
GPT-4-Turbo | 42.4 |
Gemini-1.5-Pro | 33.9 |
Claude-3-Opus | 32.8 |
CogQA | 37.1 |
DrKIT | 42.1 |
Transformer-XH | 51.6 |
QAMAT+ | 57.6 |
HGN | 59.7 |
PathRetriever | 60.0 |
HopRetrieve | 62.1 |
MDR | 62.3 |
HopRetrieve-plus | 66.5 |
AISO | 68.1 |
COS | 68.2 |
DSP | 51.4 |
PromptRank | 55.7 |
LongRAG (Gemini-1.5-Pro; Recall 8 units) | 57.5 |
LongRAG (GPT-4o; Recall 8 units) | 64.3 |
The table shows the QA results on the HotpotQA dev set. We compare the results with three groups of baselines: closed-book, which involves directly prompting state-of-the-art LLMs with 16-shot in-context examples; fully-supervised RAG, where the RAG framework is used and the model is fully supervised and trained on the training data; and No Fine-tuning RAG, which employs the RAG framework without any tuning
@article{jiang2024longrag
title={LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs},
author={Ziyan Jiang, Xueguang Ma, Wenhu Chen},
journal={arXiv preprint arXiv:2406.15319},
year={2024},
url={https://arxiv.org/abs/2406.15319}
}