Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7Bβs (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large- scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new perspective on building better instruction tuning data.
Figure 1: Overview of MAmmoTH2-Plus results. The MAmmoTH2-8x7B-Plus variant outperforms Mixtral-Instruct on reasoning benchmarks, matching Qwen-1.5-110B with only 13% active parame-ters. It also surpasses Mixtral-Instruct by around 10 points on general code and chatbot benchmarks.
In this section, we outline the process of constructing WebInstruct from the web corpus. Specif- ically, we divide the data collection pipeline into two stages: (1) high-quality data recall from the web corpus, and (2) Q-A pair extraction and (3) Q-A pair refinement. We depict the full pipeline in Figure 3 and provide an example for extraction and refinement in Figure 4
To ensure diversity in our training data across various disciplines like math, science, and engineering, we propose crawling exam problems from educational websites such as stemez.com, homeworkstudy.com, and khanacademy.org. We collected 100K diverse seed examples and randomly selected 100K negative documents from Common Crawl (CC) for training a fastText model. Using the open-source fastText library with 256-dimensional vectors, we trained the model for 3 epochs with a learning rate of 0.1, allowing n-grams up to length 3, and capping word occurrences at 3.
In the initial stage, the trained fastText model recalls the top 100B documents from CC, categorizing them by domain (root URL). We employ GPT-4 to identify domains likely to contain instructional content, achieving satisfactory results through in-context learning. Subsequently, we sample additional documents from these selected domains as positive examples and use documents from non-selected domains and the general CC as negative examples to refine the fastText classifier. The updated classifier then recalls the top 18M documents for further processing.
Figure 5: The distribution of the top 25 URLs in our instruction dataset.
Recalled documents contain diverse content from forums, homework, quizzes, and exams. Despite noise like ads and HTML, they contain valuable Q&A pairs. To extract useful content, we preprocess by parsing HTML to remove unrelated info. We then use Mixtral-8Γ7B to identify Q&A pairs, resulting in 5M candidates. However, many lack explanations. To improve quality, we refine further.We filter out web pages containing questions or answers to evaluation benchmarks to avoid contamination.
To further improve extracted Q-A pair candidates, we propose to refine them with LLMs. In this step, we prompt Mixtral-7BΓ8 and Qwen-72B to reformat the extracted Q-A pairs. If the answer does not contain any explanation, we prompt the LLMs to complete the intermediate reasoning steps leading to the answer. We adopt two models to increase diversity. Eventually, we harvest 10M Q-A pairs as our final instruction-tuning dataset WebInstruct.
In Table 1, existing datasets are compared with WebInstruct. Most SFT datasets contain less than 1M samples of very high quality. XwinMath and OpenMathInstruct are the largest datasets, surpassing 1M samples through GPT-4/Mixtral synthesis. However, they suffer from narrow domain coverage, mainly based on GSM and MATH. This results in over-fitting to these benchmarks, as shown in Table 2. On the other hand, CT datasets, often sourced from the web, are much larger, exceeding 10B tokens and even reaching 120B tokens, but are costlier to train and have a higher noise ratio. WebInstruct strikes a balance between scalability and quality, nearing the scalability of CT datasets while maintaining high quality through a three-step construction pipeline. This uniqueness sets it apart from other alternatives.
Dataset | #Pairs | Domain | Format | Dataset Source |
---|---|---|---|---|
FLAN V2 | 100K | General | SFT | NLP data + Human CoT |
Self-Instruct | 82K | General | SFT | Generated by GPT3 |
GPT4-Alpaca | 52K | General | SFT | Generated by GPT4 |
SuperNI | 96K | General | SFT | NLP Datasets |
Tora | 16K | Math | SFT | GPT4 GSM+MATH Synthesis |
WizardMath | 96K | Math | SFT | GPT4 GSM+MATH Synthesis |
MathInstruct | 262K | Math | SFT | GPT4 Math datasets Synthesis |
MetaMathQA | 395K | Math | SFT | GPT-3.5-Turbo GSM+MATH Synthesis |
XwinMath | 1.4M | Math | SFT | GPT4 GSM+MATH Synthesis |
OpenMathInstruct | 1.8M | Math | SFT | Mixtral GSM+MATH Synthesis |
Dataset | #Tokens | Domain | Format | Dataset Source |
---|---|---|---|---|
OpenWebMath | 12B | Math | LM | Filtered from Web |
MathPile | 10B | Math | LM | Filtered from Web |
Cosmopeida | 25B | General | LM | Synthesized by Mixtral |
MINERVA | 38B | Math | LM | Filtered from Web |
Proof-Pile-2 | 55B | Math | LM | OWM+Arxiv+Code |
Galactica | 106B | Math & Sci. | LM | Filtered from Web |
DeepseekMath | 120B | Math | LM | Recalled from Web |
WebInstruct | (10M) 5B | Math & Sci. | SFT | Recall and Extracted from Web |
Model | TheoremQA | MATH | GSM8K | GPQA | MMLU-ST | BBH | ARC-C | Avg |
---|---|---|---|---|---|---|---|---|
GPT-4-Turbo-0409 | 48.4 | 69.2 | 94.5 | 46.2 | 76.5 | 86.7 | 93.6 | 73.6 |
Parameter Size between 20B and 110B | ||||||||
Qwen-1.5-110B | 34.9 | 49.6 | 85.4 | 35.9 | 73.4 | 74.8 | 91.6 | 63.6 |
Qwen-1.5-72B | 29.3 | 46.8 | 77.6 | 36.3 | 68.5 | 68.0 | 92.2 | 59.8 |
Deepseek-LM-67B | 25.3 | 15.9 | 66.5 | 31.8 | 57.4 | 71.7 | 86.8 | 50.7 |
Yi-34B | 23.2 | 15.9 | 67.9 | 29.7 | 62.6 | 66.4 | 89.5 | 50.7 |
Llemma-34B | 21.1 | 25.0 | 71.9 | 29.2 | 54.7 | 48.4 | 69.5 | 45.7 |
Mixtral-8×7B | 23.2 | 28.4 | 74.4 | 29.7 | 59.7 | 66.8 | 84.7 | 52.4 |
Mixtral-8×7B-Instruct | 25.3 | 22.1 | 71.7 | 32.4 | 61.4 | 57.3 | 84.7 | 50.7 |
Intern-Math-20B | 17.1 | 37.7 | 82.9 | 28.9 | 50.1 | 39.3 | 68.6 | 46.4 |
Trained only with WebInstruct (All evaluations are held-out) | ||||||||
MAmmoTH2-34B | 30.4 | 35.0 | 75.6 | 31.8 | 64.5 | 68.0 | 90.0 | 56.4 |
MAmmoTH2-8x7B | 32.2 | 39.0 | 75.4 | 36.8 | 67.4 | 71.1 | 87.5 | 58.9 |
Continue trained with additional instruction datasets (All held-out except MATH and GSM8K | ||||||||
MAmmoTH2-8x7B-Plus | 34.1 | 47.0 | 86.4 | 37.8 | 72.4 | 74.1 | 88.4 | 62.9 |
Parameter Size = 7B or 8B | ||||||||
Deepseek-7B | 15.7 | 6.4 | 17.4 | 25.7 | 43.1 | 42.8 | 47.8 | 28.4 |
Qwen-1.5-7B | 14.2 | 13.3 | 54.1 | 26.7 | 45.4 | 45.2 | 75.6 | 39.2 |
Mistral-7B | 19.2 | 11.2 | 36.2 | 24.7 | 50.1 | 55.7 | 74.2 | 38.8 |
Gemma-7B | 21.5 | 24.3 | 46.4 | 25.7 | 53.3 | 57.4 | 72.5 | 43.0 |
Llemma-7B | 17.2 | 18.0 | 36.4 | 23.2 | 45.2 | 44.9 | 50.5 | 33.6 |
WizardMath-7B-1.1 | 11.7 | 33.0 | 83.2 | 28.7 | 52.7 | 56.7 | 76.9 | 49.0 |
OpenMath-Mistral | 13.1 | 9.1 | 24.5 | 26.5 | 43.7 | 49.5 | 69.4 | 33.7 |
Abel-7B-002 | 19.3 | 29.5 | 83.2 | 30.3 | 29.7 | 32.7 | 72.5 | 42.5 |
Intern-Math-7B | 13.2 | 34.6 | 78.1 | 22.7 | 41.1 | 48.1 | 59.8 | 42.5 |
Rho-1-Math-7B | 21.0 | 31.0 | 66.9 | 29.2 | 53.1 | 57.7 | 72.7 | 47.3 |
Deepseek-Math-7B | 25.3 | 34.0 | 64.2 | 29.2 | 56.4 | 59.5 | 67.8 | 48.0 |
Deepseek-Math-Instruct | 23.7 | 44.3 | 82.9 | 31.8 | 59.3 | 55.4 | 70.1 | 52.5 |
Llama-3-8B | 20.1 | 21.3 | 54.8 | 27.2 | 55.6 | 61.1 | 78.6 | 45.5 |
Llama-3-8B-Instruct | 22.8 | 30.0 | 79.5 | 34.5 | 60.2 | 66.0 | 80.8 | 53.4 |
Trained only with WebInstruct (All evaluations are held-out) | ||||||||
MAmmoTH2-7B | 29.0 | 36.7 | 68.4 | 32.4 | 62.4 | 58.6 | 81.7 | 52.8 |
MAmmoTH2-8B | 32.2 | 35.8 | 70.4 | 35.2 | 64.2 | 62.1 | 82.2 | 54.3 |
Continue trained with additional instruction datasets (All held-out except MATH and GSM8K) | ||||||||
MAmmoTH2-7B-Plus | 29.2 | 45.0 | 84.7 | 36.8 | 64.5 | 63.1 | 83.0 | 58.0 |
MAmmoTH2-8B-Plus | 32.5 | 42.8 | 84.1 | 37.3 | 65.7 | 67.8 | 83.4 | 59.1 |
HumanEval | HumanEval+ | MBPP | MBPP+ | Average | Average+ | |
---|---|---|---|---|---|---|
Mistral-7B | 28.7 | 23.8 | 51.9 | 42.1 | 40.3 | 33.0 |
Gemma-7B | 26.8 | 20.1 | 52.6 | 43.4 | 39.7 | 31.8 |
Llama-3-8B | 33.5 | 29.3 | 61.4 | 51.6 | 47.5 | 40.5 |
Gemma-1.1-7B-Instruct | 42.7 | 35.4 | 57.1 | 45.0 | 49.9 | 40.2 |
Mistral-7B-Instruct-v0.2 | 75.0 | 70.1 | 44.7 | 37.0 | 59.9 | 53.6 |
Llama-3-8B-Instruct | 61.6 | 56.7 | 70.1 | 59.3 | 65.9 | 58.0 |
Mixtral-8×7B-Instruct-v0.1 | 45.1 | 39.6 | 59.5 | 49.7 | 52.3 | 44.7 |
MAmmoTH2-7B-Plus | 72.1 | 65.9 | 60.1 | 50.4 | 66.1 | 58.2 |
MAmmoTH2-8B-Plus | 63.4 | 57.9 | 60.4 | 48.6 | 61.9 | 53.3 |
MAmmoTH2-8x7B-Plus | 57.9 | 53.7 | 68.7 | 56.9 | 63.3 | 55.3 |
MT-Bench | AlpacaEval 2.0 | Arena Hard | MMLU | |
---|---|---|---|---|
GPT-4-1106-preview | 9.32 | 50.0 | - | - |
GPT-3.5-Turbo-1106 | 8.32 | 19.3 | 18.9 | - |
GPT-3.5-Turbo-0301 | 7.94 | 18.1 | 18.1 | 70.0 |
Tulu-2-DPO-70B | 7.89 | 21.2 | 15.0 | 67.8 |
Llama-2-70b-chat | 6.86 | 14.7 | 11.6 | 63.0 |
Yi-34B-Chat | 7.86 | 27.2 | 23.1 | 73.5 |
Gemma-1.1-7B-Instruct | - | 10.4 | 7.5 | 64.3 |
Mistral-7B-Instruct-v0.2 | 7.60 | 17.1 | 12.6 | 60.8 |
Llama-3-8B-Instruct | 8.02 | 22.9 | 20.6 | 67.2 |
Mixtral-8×7B-Instruct-v0.1 | 8.30 | 23.7 | 23.4 | 70.6 |
MAmmoTH2-7B-Plus | 7.88 | 23.4 | 14.6 | 63.3 |
MAmmoTH2-8B-Plus | 7.95 | 18.5 | 16.6 | 64.6 |
MAmmoTH2-8x7B-Plus | 8.20 | 33.8 | 32.6 | 68.3 |