General-Reasoner: Advancing LLM Reasoning Across All Domains

♥*Xueguang Ma, ♦*Qian Liu, ♥♠Dongfu Jiang, ♥♣Ge Zhang, Zejun Ma, ♥♠Wenhu Chen
University of Waterloo, Vector Institute, TikTok, Singapore, M-A-P
x93ma@uwaterloo.ca, wenhuchen@uwaterloo.ca

Abstract

Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the "Zero" reinforcement learning introduced by Deepseek-R1-Zero enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering domains like physics, chemistry, finance, and electronics. Our comprehensive evaluation across 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baselines, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.

teaser

Figure 1: Effectiveness of our General-Reasoner trained with diverse verifiable reasoning questions using model-based verifier compared to baseline methods on various reasoning tasks.

General-Reasoner Training Paradigm

We propose a new training paradigm to broaden LLM reasoning beyond mathematics, consisting of two key components:

  • WebInstruct-verified: A diverse, large-scale dataset containing 230K high-quality reasoning questions across domains such as physics, chemistry, finance, and the social sciences, curated and filtered for verifiability using LLMs.
  • General-Verifier: A 1.5B parameter generative verifier model that enables context-aware, chain-of-thought based answer verification for a wide range of answer types, improving reward reliability during RL training.

Diverse Verifiable Reasoning Tasks

To facilitate robust reasoning capabilities across a wide range of domains beyond mathematical problems, we construct a large-scale, diverse, and high-quality dataset composed of verifiable reasoning tasks. Our dataset-building pipeline is illustrated below.

The initial data source is from WebInstruct, which contains around 5 million naturally occurring, web-crawled instructions from sites like StackExchange and educational portals. While useful for general instruction tuning, most entries lack verifiable answers or reasoning structure.

We trace entries back to their original web pages to extract question-answer pairs. Questions without clearly written human answers are discarded to ensure quality. Gemini-1.5-Pro is then used to identify verifiable questions with concise answers, yielding 1M candidates. Gemini-2.0-Flash annotates metadata like answer type and difficulty. We downsample easy math entries to maintain balance.

We further apply quality filtering by generating 8 candidate answers using Gemini-2.0-Flash:

  • Remove questions where all 8 answers fail (ambiguous or noisy).
  • Remove trivial questions where all 8 are correct (low difficulty).

Finalized examples help train our model-based verifier. The resulting dataset has ~230K reasoning problems with varied answer formats and topics.

Data Pipeline

Figure 2: Data creation pipeline: It consists of QA mining, extraction, and verification.

Answer Type Distribution

Figure 3: Answer Type Distribution

Domain Distribution

Figure 4: Domain Distribution

The dataset covers multiple fields such as mathematics, physics, chemistry, finance, and humanities. This rigorous process ensures the dataset is reliable, verifiable, and diverse, enabling the training of generalizable reasoning LLMs.

Generative Model-Based Verifier

Traditional rule-based verifiers, commonly used for mathematical reasoning, rely on rigid matching or symbolic comparison to determine correctness. While efficient for math tasks, they face major limitations in broader reasoning:

  • Rigid Matching Criteria: They struggle to recognize semantically equivalent answers expressed differently.
  • Semantic Insensitivity: They cannot interpret varied but valid formats (e.g., different units or synonyms).
  • Lack of Generality: Adapting them to diverse domains like finance, chemistry, or humanities is impractical.

To overcome these challenges, we develop a compact generative model-based verifier trained to determine answer equivalence in a chain-of-thought format. Rather than relying on expensive LLMs like Gemini-2.0, our verifier is a 1.5B-parameter model fine-tuned from Qwen2.5-Math-1.5B using our curated dataset.

The verifier takes in the question, reference answer, and student-generated answer, and generates a reasoning trace followed by a binary true/false verdict. This provides accurate, interpretable reward signals for reinforcement learning.

Our model-based verifier demonstrates strong alignment with Gemini-2.0-Flash and significantly outperforms traditional rule-based methods in robustness and generality.

Verifier Comparison

Figure 5: Comparison between traditional rule-based and our generative model-based verifier across domains.

Results on Reasoning Benchmarks

We evaluate General-Reasoner across a variety of reasoning tasks. Our model, trained using Zero RL, consistently outperforms both base and supervised models initialized from Qwen2.5 and Qwen3 backbones.

For example, with Qwen2.5-7B-Base, General-Reasoner achieves 58.9% on MMLU-Pro, surpassing both the base (47.7%) and instructed (57.0%) models. With the Qwen2.5-14B backbone, performance further improves, reaching 66.6% on MMLU-Pro.

Our approach also demonstrates strong results in math-related tasks across both 7B and 14B scales. Compared to reinforcement learning methods such as SimpleRL and Nemotraon-CrossThink, General-Reasoner leads on benchmarks including GPQA, SuperGPQA, and BBEH.

The Qwen3 backbone provides additional improvements: General-Reasoner-4B achieves 62.8% on MMLU-Pro, outperforming the 7B Qwen2.5 variant. Our best model, General-Reasoner-Qwen3-14B, reaches 56.1% on GPQA and 54.4% on TheoremQA, comparable to closed-source models like GPT-4o.

While a gap remains on some benchmarks versus commercial systems, our results highlight the promise of Zero RL when paired with diverse reasoning data and compact model-based reward verifiers.

General Reasoning Results

Table 1: Accuracy comparison of our General-Reasoner with baseline methods on general reasoning benchmarks.

Math Benchmark Results

Table 2: Accuracy comparison across math-related benchmarks.

Citation

Please cite our work as below if you find it helpful:

@article{general-reasoner,
    title={{G}eneral-{R}easoner: Advancing {LLM} Reasoning Across All Domains}, 
    author={Xueguang Ma and Qian Liu and Dongfu Jiang and Ge Zhang and Zejun Ma and Wenhu Chen},
    year={2025},
    journal={arXiv:2505.14652},
    url={https://arxiv.org/abs/2505.14652}, 
}