🦣 MAmmoTH2:
Scaling Instructions from the Web

Carnegie Mellon University, University of Waterloo
xyue2@andrew.cmu.edu , wenhuchen@uwaterloo.ca

Abstract

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B’s (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large- scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new perspective on building better instruction tuning data.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 1: Overview of MAmmoTH2-Plus results. The MAmmoTH2-8x7B-Plus variant outperforms Mixtral-Instruct on reasoning benchmarks, matching Qwen-1.5-110B with only 13% active parame-ters. It also surpasses Mixtral-Instruct by around 10 points on general code and chatbot benchmarks.

Introduction

Reasoning is crucial for problem-solving and advancing knowledge. While large language models (LLMs) have made strides in natural language processing (NLP), their ability in complex reasoning tasks remains limited. Efforts to enhance LLMs' reasoning include continued training on filtered documents and instruction tuning with supervised fine-tuning loss. However, datasets for instruction tuning are often limited and biased, prompting a need for scalable and diverse instruction data.

To address this, we propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets.

Validating our dataset's effectiveness, we trained models on various base models, significantly outperforming them on seven reasoning benchmarks. Further tuning on open-source instruction datasets enhances performance, showcasing strong generalization abilities. Our approach demonstrates cost-effectiveness compared to human-annotated datasets, offering promising results for future instruction tuning studies.
Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 2: Comparison between our dataset curation method and previous studies.

WebInstruct

In this section, we outline the process of constructing WebInstruct from the web corpus. Specif- ically, we divide the data collection pipeline into two stages: (1) high-quality data recall from the web corpus, and (2) Q-A pair extraction and (3) Q-A pair refinement. We depict the full pipeline in Figure 3 and provide an example for extraction and refinement in Figure 4

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 3: Step 1: Recall relevant documents from Common Crawl. Step 2: Extracting Q-A pairs. Step 3: Refine with the extracted Q-A pairs.

Recall from Common Crawl

To ensure diversity in our training data across various disciplines like math, science, and engineering, we propose crawling exam problems from educational websites such as stemez.com, homeworkstudy.com, and khanacademy.org. We collected 100K diverse seed examples and randomly selected 100K negative documents from Common Crawl (CC) for training a fastText model. Using the open-source fastText library with 256-dimensional vectors, we trained the model for 3 epochs with a learning rate of 0.1, allowing n-grams up to length 3, and capping word occurrences at 3.

In the initial stage, the trained fastText model recalls the top 100B documents from CC, categorizing them by domain (root URL). We employ GPT-4 to identify domains likely to contain instructional content, achieving satisfactory results through in-context learning. Subsequently, we sample additional documents from these selected domains as positive examples and use documents from non-selected domains and the general CC as negative examples to refine the fastText classifier. The updated classifier then recalls the top 18M documents for further processing.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 5: The distribution of the top 25 URLs in our instruction dataset.

Q-A Pair Extraction

Recalled documents contain diverse content from forums, homework, quizzes, and exams. Despite noise like ads and HTML, they contain valuable Q&A pairs. To extract useful content, we preprocess by parsing HTML to remove unrelated info. We then use Mixtral-8×7B to identify Q&A pairs, resulting in 5M candidates. However, many lack explanations. To improve quality, we refine further.We filter out web pages containing questions or answers to evaluation benchmarks to avoid contamination.

Q-A Pair Refinement

To further improve extracted Q-A pair candidates, we propose to refine them with LLMs. In this step, we prompt Mixtral-7B×8 and Qwen-72B to reformat the extracted Q-A pairs. If the answer does not contain any explanation, we prompt the LLMs to complete the intermediate reasoning steps leading to the answer. We adopt two models to increase diversity. Eventually, we harvest 10M Q-A pairs as our final instruction-tuning dataset WebInstruct.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 4: An illustrating example from WebInstruct for the extraction and refine step

Dataset Statistics

In Table 1, existing datasets are compared with WebInstruct. Most SFT datasets contain less than 1M samples of very high quality. XwinMath and OpenMathInstruct are the largest datasets, surpassing 1M samples through GPT-4/Mixtral synthesis. However, they suffer from narrow domain coverage, mainly based on GSM and MATH. This results in over-fitting to these benchmarks, as shown in Table 2. On the other hand, CT datasets, often sourced from the web, are much larger, exceeding 10B tokens and even reaching 120B tokens, but are costlier to train and have a higher noise ratio. WebInstruct strikes a balance between scalability and quality, nearing the scalability of CT datasets while maintaining high quality through a three-step construction pipeline. This uniqueness sets it apart from other alternatives.

Dataset #Pairs Domain Format Dataset Source
FLAN V2 100K General SFT NLP data + Human CoT
Self-Instruct 82K General SFT Generated by GPT3
GPT4-Alpaca 52K General SFT Generated by GPT4
SuperNI 96K General SFT NLP Datasets
Tora 16K Math SFT GPT4 GSM+MATH Synthesis
WizardMath 96K Math SFT GPT4 GSM+MATH Synthesis
MathInstruct 262K Math SFT GPT4 Math datasets Synthesis
MetaMathQA 395K Math SFT GPT-3.5-Turbo GSM+MATH Synthesis
XwinMath 1.4M Math SFT GPT4 GSM+MATH Synthesis
OpenMathInstruct 1.8M Math SFT Mixtral GSM+MATH Synthesis
Dataset #Tokens Domain Format Dataset Source
OpenWebMath 12B Math LM Filtered from Web
MathPile 10B Math LM Filtered from Web
Cosmopeida 25B General LM Synthesized by Mixtral
MINERVA 38B Math LM Filtered from Web
Proof-Pile-2 55B Math LM OWM+Arxiv+Code
Galactica 106B Math & Sci. LM Filtered from Web
DeepseekMath 120B Math LM Recalled from Web
WebInstruct (10M) 5B Math & Sci. SFT Recall and Extracted from Web
Table 1: The list of existing supervise-fine-tuning and continue-training datasets. The SFT datasets are mostly from NLP datasets or completely synthesized by GPT-4. The CT datasets are much larger because they are filtered or recalled from the web. The the content contains lots of noise. We are the first dataset to combine these two to build high-quality yet large-scale SFT dataset.

Model TheoremQA MATH GSM8K GPQA MMLU-ST BBH ARC-C Avg
GPT-4-Turbo-0409 48.4 69.2 94.5 46.2 76.5 86.7 93.6 73.6
Parameter Size between 20B and 110B
Qwen-1.5-110B 34.9 49.6 85.4 35.9 73.4 74.8 91.6 63.6
Qwen-1.5-72B 29.3 46.8 77.6 36.3 68.5 68.0 92.2 59.8
Deepseek-LM-67B 25.3 15.9 66.5 31.8 57.4 71.7 86.8 50.7
Yi-34B 23.2 15.9 67.9 29.7 62.6 66.4 89.5 50.7
Llemma-34B 21.1 25.0 71.9 29.2 54.7 48.4 69.5 45.7
Mixtral-8×7B 23.2 28.4 74.4 29.7 59.7 66.8 84.7 52.4
Mixtral-8×7B-Instruct 25.3 22.1 71.7 32.4 61.4 57.3 84.7 50.7
Intern-Math-20B 17.1 37.7 82.9 28.9 50.1 39.3 68.6 46.4
Trained only with WebInstruct (All evaluations are held-out)
MAmmoTH2-34B 30.4 35.0 75.6 31.8 64.5 68.0 90.0 56.4
MAmmoTH2-8x7B 32.2 39.0 75.4 36.8 67.4 71.1 87.5 58.9
Continue trained with additional instruction datasets (All held-out except MATH and GSM8K
MAmmoTH2-8x7B-Plus 34.1 47.0 86.4 37.8 72.4 74.1 88.4 62.9
Parameter Size = 7B or 8B
Deepseek-7B 15.7 6.4 17.4 25.7 43.1 42.8 47.8 28.4
Qwen-1.5-7B 14.2 13.3 54.1 26.7 45.4 45.2 75.6 39.2
Mistral-7B 19.2 11.2 36.2 24.7 50.1 55.7 74.2 38.8
Gemma-7B 21.5 24.3 46.4 25.7 53.3 57.4 72.5 43.0
Llemma-7B 17.2 18.0 36.4 23.2 45.2 44.9 50.5 33.6
WizardMath-7B-1.1 11.7 33.0 83.2 28.7 52.7 56.7 76.9 49.0
OpenMath-Mistral 13.1 9.1 24.5 26.5 43.7 49.5 69.4 33.7
Abel-7B-002 19.3 29.5 83.2 30.3 29.7 32.7 72.5 42.5
Intern-Math-7B 13.2 34.6 78.1 22.7 41.1 48.1 59.8 42.5
Rho-1-Math-7B 21.0 31.0 66.9 29.2 53.1 57.7 72.7 47.3
Deepseek-Math-7B 25.3 34.0 64.2 29.2 56.4 59.5 67.8 48.0
Deepseek-Math-Instruct 23.7 44.3 82.9 31.8 59.3 55.4 70.1 52.5
Llama-3-8B 20.1 21.3 54.8 27.2 55.6 61.1 78.6 45.5
Llama-3-8B-Instruct 22.8 30.0 79.5 34.5 60.2 66.0 80.8 53.4
Trained only with WebInstruct (All evaluations are held-out)
MAmmoTH2-7B 29.0 36.7 68.4 32.4 62.4 58.6 81.7 52.8
MAmmoTH2-8B 32.2 35.8 70.4 35.2 64.2 62.1 82.2 54.3
Continue trained with additional instruction datasets (All held-out except MATH and GSM8K)
MAmmoTH2-7B-Plus 29.2 45.0 84.7 36.8 64.5 63.1 83.0 58.0
MAmmoTH2-8B-Plus 32.5 42.8 84.1 37.3 65.7 67.8 83.4 59.1
Table 2: Our main results on various science reasoning datasets. All the models without ‘-Instruct’ refers to the released base model before instruction tuning. For the experimental results, if they are reported by official paper or OpenCompass, we take the reported numbers. If not, we will use our own script for evaluation. Underscored results are the best baseline scores under the size constraint.
Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 6: Mistral-7B model performance improves with scaling instructions. Additionally, SFT Loss is a more effective learning approach compared to LM Loss.

HumanEval HumanEval+ MBPP MBPP+ Average Average+
Mistral-7B 28.7 23.8 51.9 42.1 40.3 33.0
Gemma-7B 26.8 20.1 52.6 43.4 39.7 31.8
Llama-3-8B 33.5 29.3 61.4 51.6 47.5 40.5
Gemma-1.1-7B-Instruct 42.7 35.4 57.1 45.0 49.9 40.2
Mistral-7B-Instruct-v0.2 75.0 70.1 44.7 37.0 59.9 53.6
Llama-3-8B-Instruct 61.6 56.7 70.1 59.3 65.9 58.0
Mixtral-8×7B-Instruct-v0.1 45.1 39.6 59.5 49.7 52.3 44.7
MAmmoTH2-7B-Plus 72.1 65.9 60.1 50.4 66.1 58.2
MAmmoTH2-8B-Plus 63.4 57.9 60.4 48.6 61.9 53.3
MAmmoTH2-8x7B-Plus 57.9 53.7 68.7 56.9 63.3 55.3
Table 3:Code generation results of different models. Baseline results are copied from the EvalPlus leaderboard.

MT-Bench AlpacaEval 2.0 Arena Hard MMLU
GPT-4-1106-preview 9.32 50.0 - -
GPT-3.5-Turbo-1106 8.32 19.3 18.9 -
GPT-3.5-Turbo-0301 7.94 18.1 18.1 70.0
Tulu-2-DPO-70B 7.89 21.2 15.0 67.8
Llama-2-70b-chat 6.86 14.7 11.6 63.0
Yi-34B-Chat 7.86 27.2 23.1 73.5
Gemma-1.1-7B-Instruct - 10.4 7.5 64.3
Mistral-7B-Instruct-v0.2 7.60 17.1 12.6 60.8
Llama-3-8B-Instruct 8.02 22.9 20.6 67.2
Mixtral-8×7B-Instruct-v0.1 8.30 23.7 23.4 70.6
MAmmoTH2-7B-Plus 7.88 23.4 14.6 63.3
MAmmoTH2-8B-Plus 7.95 18.5 16.6 64.6
MAmmoTH2-8x7B-Plus 8.20 33.8 32.6 68.3
Table 4: Evaluation of instruction-following and MMLU performance for various models. Baseline scores are sourced from the original papers or the MT-Bench, AlpacaEval 2.0, and Arena Hard leader- boards. (“-”) indicates that the score was not available from the referenced sources. MAmmoTH2-Plus exhibits strong general conversational ability and excels at multitask language understanding across a wide range of domains compared to their official instruct counterparts and larger models