LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Ziyan Jiang, Xueguang Ma, Wenhu Chen

▶ University of Waterloo

Abstract

In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the "needle" unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced heavy retriever and light reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a "long retriever" and a "long reader". LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 600K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units (≈ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ and 64.3% on HotpotQA (full-wiki), which is on par with the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.

Figure 1: Traditional RAG vs. LongRAG.

Traditional RAG operates on short retrieval units, where retriever needs to scan over massive amount of units to find the relevant piece. In contrast, LongRAG operates on long retrieval units (30x longer). Retriever has a much less workload, which significantly boosts the recall score. LongRAG fully exploits the ability of long-context language models (reader) to achieve strong performance.

Framework

Figure 2: Our proposed LongRAG framework is comprised of two components: the Long Retriever and the Long Reader.

On the left side, it shows that the long retrieval unit is grouped by Wikipedia documents through hyperlinks. Each retrieval unit contains an average of 4K tokens, corresponding to multiple related documents. On the right side, it shows a multi-hop question answer test case from HotpotQA. The final result can be achieved by using only a few retrieval units, which is then fed into a long reader.

Retrieval Results

Retrieval Unit	Corpus Size	Num of Retrieval Units	Average Num of Tokens		Answer Recall (AR)
Retrieval Unit	Corpus Size	Num of Retrieval Units	Corpus	Test Set	Answer Recall (AR)
Passage	22M	1	120	130	52.24
		100	12K	14K	89.92
		200	24K	28K	91.30
Document	3M	1	820	4K	69.45
		5	4K	18K	85.37
		10	8K	34K	88.12
Grouped Documents	600K	1	4K	6K	71.69
		4	16K	25K	86.30
		8	32K	50K	88.53

Employing a long-context retriever (with an average number of tokens for each retrieval unit up to 6K) compresses the corpus size by up to 30 times (from 22M to 600K), enhancing top-1 answer recall by approximately 20 points (from 52.24 to 71.69). Furthermore, long-context retrieval requires significantly fewer retrieval units (10 times fewer) to achieve comparable results. Therefore, integrating long-context retrieval significantly alleviates the burden on the retriever model.

Retrieval Unit	Corpus Size	Num of Retrieval Units	Average Num of Tokens		Recall (R)	Answer Recall (AR)
Retrieval Unit	Corpus Size	Num of Retrieval Units	Corpus	Test Set	Recall (R)	Answer Recall (AR)
Document	5.2M	2	130	200	30.01	47.75
		100	6.5K	10K	74.84	84.67
		200	13K	20K	79.68	88.34
Grouped Documents	500K	2	1K	8K	56.30	72.49
Grouped Documents	500K	8	4K	29K	74.71	84.40

Similar to the findings on NQ, a long-context retrieval could significantly alleviate the burden on the retriever component within the entire RAG framework.

Method	EM(Exact Match Rate)
GPT-4-Turbo	41.2
Gemini-1.5-Pro	47.8
Claude-3-Opus	49.2
REALM	40.4
DPR	41.5
RAG	44.5
RETRO	45.5
RePAQ	47.8
FiD	51.4
EMDR	52.5
Atlas	64.0
REPLUG	45.5
LongRAG (Gemini-1.5-Pro; Recall 4 units)	58.6
LongRAG (GPT-4o; Recall 4 units)	62.7

Method	EM(Exact Match Rate)
GPT-4-Turbo	42.4
Gemini-1.5-Pro	33.9
Claude-3-Opus	32.8
CogQA	37.1
DrKIT	42.1
Transformer-XH	51.6
QAMAT+	57.6
HGN	59.7
PathRetriever	60.0
HopRetrieve	62.1
MDR	62.3
HopRetrieve-plus	66.5
AISO	68.1
COS	68.2
DSP	51.4
PromptRank	55.7
LongRAG (Gemini-1.5-Pro; Recall 8 units)	57.5
LongRAG (GPT-4o; Recall 8 units)	64.3