MAmmoTH2

Abstract

Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involves (1) recalling relevant documents, (2) extracting instruction-response pairs, and (3) refining the extracted pairs using open-source LLMs. Fine-tuning base LLMs on this dataset, we build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmoTH2-7B’s (Mistral) performance increases from 11% to 34% on MATH and from 36% to 67% on GSM8K without training on any in-domain data. Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. Our work demonstrates how to harvest large- scale, high-quality instruction data without costly human annotation or GPT-4 distillation, providing a new perspective on building better instruction tuning data.

Hybrid Instruction Tuning of MAmmoTH2-Plus

Figure 1: Overview of MAmmoTH2-Plus results. The MAmmoTH2-8x7B-Plus variant outperforms Mixtral-Instruct on reasoning benchmarks, matching Qwen-1.5-110B with only 13% active parame-ters. It also surpasses Mixtral-Instruct by around 10 points on general code and chatbot benchmarks.

Introduction

Reasoning is crucial for problem-solving and advancing knowledge. While large language models (LLMs) have made strides in natural language processing (NLP), their ability in complex reasoning tasks remains limited. Efforts to enhance LLMs' reasoning include continued training on filtered documents and instruction tuning with supervised fine-tuning loss. However, datasets for instruction tuning are often limited and biased, prompting a need for scalable and diverse instruction data.

To address this, we propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets.

Validating our dataset's effectiveness, we trained models on various base models, significantly outperforming them on seven reasoning benchmarks. Further tuning on open-source instruction datasets enhances performance, showcasing strong generalization abilities. Our approach demonstrates cost-effectiveness compared to human-annotated datasets, offering promising results for future instruction tuning studies.

Figure 2: Comparison between our dataset curation method and previous studies.

WebInstruct

In this section, we outline the process of constructing WebInstruct from the web corpus. Specif- ically, we divide the data collection pipeline into two stages: (1) high-quality data recall from the web corpus, and (2) Q-A pair extraction and (3) Q-A pair refinement. We depict the full pipeline in Figure 3 and provide an example for extraction and refinement in Figure 4

Figure 3: Step 1: Recall relevant documents from Common Crawl. Step 2: Extracting Q-A pairs. Step 3: Refine with the extracted Q-A pairs.

Recall from Common Crawl

To ensure diversity in our training data across various disciplines like math, science, and engineering, we propose crawling exam problems from educational websites such as stemez.com, homeworkstudy.com, and khanacademy.org. We collected 100K diverse seed examples and randomly selected 100K negative documents from Common Crawl (CC) for training a fastText model. Using the open-source fastText library with 256-dimensional vectors, we trained the model for 3 epochs with a learning rate of 0.1, allowing n-grams up to length 3, and capping word occurrences at 3.

In the initial stage, the trained fastText model recalls the top 100B documents from CC, categorizing them by domain (root URL). We employ GPT-4 to identify domains likely to contain instructional content, achieving satisfactory results through in-context learning. Subsequently, we sample additional documents from these selected domains as positive examples and use documents from non-selected domains and the general CC as negative examples to refine the fastText classifier. The updated classifier then recalls the top 18M documents for further processing.

Figure 5: The distribution of the top 25 URLs in our instruction dataset.

Q-A Pair Extraction

Recalled documents contain diverse content from forums, homework, quizzes, and exams. Despite noise like ads and HTML, they contain valuable Q&A pairs. To extract useful content, we preprocess by parsing HTML to remove unrelated info. We then use Mixtral-8×7B to identify Q&A pairs, resulting in 5M candidates. However, many lack explanations. To improve quality, we refine further.We filter out web pages containing questions or answers to evaluation benchmarks to avoid contamination.

Q-A Pair Refinement

To further improve extracted Q-A pair candidates, we propose to refine them with LLMs. In this step, we prompt Mixtral-7B×8 and Qwen-72B to reformat the extracted Q-A pairs. If the answer does not contain any explanation, we prompt the LLMs to complete the intermediate reasoning steps leading to the answer. We adopt two models to increase diversity. Eventually, we harvest 10M Q-A pairs as our final instruction-tuning dataset WebInstruct.

Figure 4: An illustrating example from WebInstruct for the extraction and refine step

Dataset Statistics

In Table 1, existing datasets are compared with WebInstruct. Most SFT datasets contain less than 1M samples of very high quality. XwinMath and OpenMathInstruct are the largest datasets, surpassing 1M samples through GPT-4/Mixtral synthesis. However, they suffer from narrow domain coverage, mainly based on GSM and MATH. This results in over-fitting to these benchmarks, as shown in Table 2. On the other hand, CT datasets, often sourced from the web, are much larger, exceeding 10B tokens and even reaching 120B tokens, but are costlier to train and have a higher noise ratio. WebInstruct strikes a balance between scalability and quality, nearing the scalability of CT datasets while maintaining high quality through a three-step construction pipeline. This uniqueness sets it apart from other alternatives.

Dataset	#Pairs	Domain	Format	Dataset Source
FLAN V2	100K	General	SFT	NLP data + Human CoT
Self-Instruct	82K	General	SFT	Generated by GPT3
GPT4-Alpaca	52K	General	SFT	Generated by GPT4
SuperNI	96K	General	SFT	NLP Datasets
Tora	16K	Math	SFT	GPT4 GSM+MATH Synthesis
WizardMath	96K	Math	SFT	GPT4 GSM+MATH Synthesis
MathInstruct	262K	Math	SFT	GPT4 Math datasets Synthesis
MetaMathQA	395K	Math	SFT	GPT-3.5-Turbo GSM+MATH Synthesis
XwinMath	1.4M	Math	SFT	GPT4 GSM+MATH Synthesis
OpenMathInstruct	1.8M	Math	SFT	Mixtral GSM+MATH Synthesis

Dataset	#Tokens	Domain	Format	Dataset Source
OpenWebMath	12B	Math	LM	Filtered from Web
MathPile	10B	Math	LM	Filtered from Web
Cosmopeida	25B	General	LM	Synthesized by Mixtral
MINERVA	38B	Math	LM	Filtered from Web
Proof-Pile-2	55B	Math	LM	OWM+Arxiv+Code
Galactica	106B	Math & Sci.	LM	Filtered from Web
DeepseekMath	120B	Math	LM	Recalled from Web
WebInstruct	(10M) 5B	Math & Sci.	SFT	Recall and Extracted from Web

Table 1: The list of existing supervise-fine-tuning and continue-training datasets. The SFT datasets are mostly from NLP datasets or completely synthesized by GPT-4. The CT datasets are much larger because they are filtered or recalled from the web. The the content contains lots of noise. We are the first dataset to combine these two to build high-quality yet large-scale SFT dataset.

Model	TheoremQA	MATH	GSM8K	GPQA	MMLU-ST	BBH	ARC-C	Avg
GPT-4-Turbo-0409	48.4	69.2	94.5	46.2	76.5	86.7	93.6	73.6
Parameter Size between 20B and 110B
Qwen-1.5-110B	34.9	49.6	85.4	35.9	73.4	74.8	91.6	63.6
Qwen-1.5-72B	29.3	46.8	77.6	36.3	68.5	68.0	92.2	59.8
Deepseek-LM-67B	25.3	15.9	66.5	31.8	57.4	71.7	86.8	50.7
Yi-34B	23.2	15.9	67.9	29.7	62.6	66.4	89.5	50.7
Llemma-34B	21.1	25.0	71.9	29.2	54.7	48.4	69.5	45.7
Mixtral-8×7B	23.2	28.4	74.4	29.7	59.7	66.8	84.7	52.4
Mixtral-8×7B-Instruct	25.3	22.1	71.7	32.4	61.4	57.3	84.7	50.7
Intern-Math-20B	17.1	37.7	82.9	28.9	50.1	39.3	68.6	46.4
Trained only with WebInstruct (All evaluations are held-out)
MAmmoTH2-34B	30.4	35.0	75.6	31.8	64.5	68.0	90.0	56.4
MAmmoTH2-8x7B	32.2	39.0	75.4	36.8	67.4	71.1	87.5	58.9
Continue trained with additional instruction datasets (All held-out except MATH and GSM8K
MAmmoTH2-8x7B-Plus	34.1	47.0	86.4	37.8	72.4	74.1	88.4	62.9
Parameter Size = 7B or 8B
Deepseek-7B	15.7	6.4	17.4	25.7	43.1	42.8	47.8	28.4
Qwen-1.5-7B	14.2	13.3	54.1	26.7	45.4	45.2	75.6	39.2
Mistral-7B	19.2	11.2	36.2	24.7	50.1	55.7	74.2	38.8
Gemma-7B	21.5	24.3	46.4	25.7	53.3	57.4	72.5	43.0
Llemma-7B	17.2	18.0	36.4	23.2	45.2	44.9	50.5	33.6
WizardMath-7B-1.1	11.7	33.0	83.2	28.7	52.7	56.7	76.9	49.0
OpenMath-Mistral	13.1	9.1	24.5	26.5	43.7	49.5	69.4	33.7
Abel-7B-002	19.3	29.5	83.2	30.3	29.7	32.7	72.5	42.5
Intern-Math-7B	13.2	34.6	78.1	22.7	41.1	48.1	59.8	42.5
Rho-1-Math-7B	21.0	31.0	66.9	29.2	53.1	57.7	72.7	47.3
Deepseek-Math-7B	25.3	34.0	64.2	29.2	56.4	59.5	67.8	48.0
Deepseek-Math-Instruct	23.7	44.3	82.9	31.8	59.3	55.4	70.1	52.5
Llama-3-8B	20.1	21.3	54.8	27.2	55.6	61.1	78.6	45.5
Llama-3-8B-Instruct	22.8	30.0	79.5	34.5	60.2	66.0	80.8	53.4
Trained only with WebInstruct (All evaluations are held-out)
MAmmoTH2-7B	29.0	36.7	68.4	32.4	62.4	58.6	81.7	52.8
MAmmoTH2-8B	32.2	35.8	70.4	35.2	64.2	62.1	82.2	54.3
Continue trained with additional instruction datasets (All held-out except MATH and GSM8K)
MAmmoTH2-7B-Plus	29.2	45.0	84.7	36.8	64.5	63.1	83.0	58.0
MAmmoTH2-8B-Plus	32.5	42.8	84.1	37.3	65.7	67.8	83.4	59.1

Table 2: Our main results on various science reasoning datasets. All the models without ‘-Instruct’ refers to the released base model before instruction tuning. For the experimental results, if they are reported by official paper or OpenCompass, we take the reported numbers. If not, we will use our own script for evaluation. Underscored results are the best baseline scores under the size constraint.

	HumanEval	HumanEval+	MBPP	MBPP+	Average	Average+
Mistral-7B	28.7	23.8	51.9	42.1	40.3	33.0
Gemma-7B	26.8	20.1	52.6	43.4	39.7	31.8
Llama-3-8B	33.5	29.3	61.4	51.6	47.5	40.5
Gemma-1.1-7B-Instruct	42.7	35.4	57.1	45.0	49.9	40.2
Mistral-7B-Instruct-v0.2	75.0	70.1	44.7	37.0	59.9	53.6
Llama-3-8B-Instruct	61.6	56.7	70.1	59.3	65.9	58.0
Mixtral-8×7B-Instruct-v0.1	45.1	39.6	59.5	49.7	52.3	44.7
MAmmoTH2-7B-Plus	72.1	65.9	60.1	50.4	66.1	58.2
MAmmoTH2-8B-Plus	63.4	57.9	60.4	48.6	61.9	53.3
MAmmoTH2-8x7B-Plus	57.9	53.7	68.7	56.9	63.3	55.3

HumanEval

HumanEval+

MBPP

MBPP+

Average

Average+

Mistral-7B

28.7

23.8

51.9

42.1

40.3

33.0

Gemma-7B

26.8

20.1

52.6

43.4

39.7

31.8

Llama-3-8B

33.5

29.3

61.4

51.6

47.5

40.5

Gemma-1.1-7B-Instruct

42.7

35.4

57.1

45.0

49.9

40.2

Mistral-7B-Instruct-v0.2

75.0

70.1

44.7

37.0

59.9

53.6

Llama-3-8B-Instruct

61.6

56.7

70.1

59.3

65.9

58.0

Mixtral-8×7B-Instruct-v0.1

45.1

39.6

59.5

49.7

52.3

44.7

MAmmoTH2-7B-Plus

72.1

65.9

60.1

50.4

66.1

58.2

MAmmoTH2-8B-Plus

63.4

57.9

60.4

48.6

61.9

53.3

MAmmoTH2-8x7B-Plus

57.9

53.7

68.7

56.9

63.3

55.3

	MT-Bench	AlpacaEval 2.0	Arena Hard	MMLU
GPT-4-1106-preview	9.32	50.0	-	-
GPT-3.5-Turbo-1106	8.32	19.3	18.9	-
GPT-3.5-Turbo-0301	7.94	18.1	18.1	70.0
Tulu-2-DPO-70B	7.89	21.2	15.0	67.8
Llama-2-70b-chat	6.86	14.7	11.6	63.0
Yi-34B-Chat	7.86	27.2	23.1	73.5
Gemma-1.1-7B-Instruct	-	10.4	7.5	64.3
Mistral-7B-Instruct-v0.2	7.60	17.1	12.6	60.8
Llama-3-8B-Instruct	8.02	22.9	20.6	67.2
Mixtral-8×7B-Instruct-v0.1	8.30	23.7	23.4	70.6
MAmmoTH2-7B-Plus	7.88	23.4	14.6	63.3
MAmmoTH2-8B-Plus	7.95	18.5	16.6	64.6
MAmmoTH2-8x7B-Plus	8.20	33.8	32.6	68.3

MT-Bench

AlpacaEval 2.0

Arena Hard

MMLU

GPT-4-1106-preview

9.32

50.0

GPT-3.5-Turbo-1106

8.32

19.3

18.9

GPT-3.5-Turbo-0301

7.94

18.1

70.0

Tulu-2-DPO-70B

7.89

21.2

15.0

67.8

Llama-2-70b-chat

6.86

14.7

11.6

63.0

Yi-34B-Chat

7.86

27.2

23.1

73.5

Gemma-1.1-7B-Instruct

10.4

7.5

64.3

Mistral-7B-Instruct-v0.2

7.60

17.1

12.6

60.8

Llama-3-8B-Instruct

8.02

22.9

20.6

67.2

Mixtral-8×7B-Instruct-v0.1

8.30

23.7

23.4

70.6

MAmmoTH2-7B-Plus

7.88

23.4

14.6

63.3

MAmmoTH2-8B-Plus

7.95

18.5

16.6

64.6

MAmmoTH2-8x7B-Plus

8.20

33.8

32.6

68.3

🦣 MAmmoTH2:
Scaling Instructions from the Web

Abstract

Figure 1: Overview of MAmmoTH2-Plus results. The MAmmoTH2-8x7B-Plus variant outperforms Mixtral-Instruct on reasoning benchmarks, matching Qwen-1.5-110B with only 13% active parame-ters. It also surpasses Mixtral-Instruct by around 10 points on general code and chatbot benchmarks.

Introduction

Figure 2: Comparison between our dataset curation method and previous studies.

WebInstruct

Figure 3: Step 1: Recall relevant documents from Common Crawl. Step 2: Extracting Q-A pairs. Step 3: Refine with the extracted Q-A pairs.

Recall from Common Crawl

Figure 5: The distribution of the top 25 URLs in our instruction dataset.

Q-A Pair Extraction

Q-A Pair Refinement

Figure 4: An illustrating example from WebInstruct for the extraction and refine step

Dataset Statistics

Figure 6: Mistral-7B model performance improves with scaling instructions. Additionally, SFT Loss is a more effective learning approach compared to LM Loss.

🦣 MAmmoTH2: Scaling Instructions from the Web

Abstract

Figure 1: Overview of MAmmoTH2-Plus results. The MAmmoTH2-8x7B-Plus variant outperforms Mixtral-Instruct on reasoning benchmarks, matching Qwen-1.5-110B with only 13% active parame-ters. It also surpasses Mixtral-Instruct by around 10 points on general code and chatbot benchmarks.

Introduction

Figure 2: Comparison between our dataset curation method and previous studies.

WebInstruct

Figure 3: Step 1: Recall relevant documents from Common Crawl. Step 2: Extracting Q-A pairs. Step 3: Refine with the extracted Q-A pairs.

Recall from Common Crawl

Figure 5: The distribution of the top 25 URLs in our instruction dataset.

Q-A Pair Extraction

Q-A Pair Refinement

Figure 4: An illustrating example from WebInstruct for the extraction and refine step

Dataset Statistics

Figure 6: Mistral-7B model performance improves with scaling instructions. Additionally, SFT Loss is a more effective learning approach compared to LM Loss.

🦣 MAmmoTH2:
Scaling Instructions from the Web