Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning

Haozhe Wang♠♣♥, Qixin Xu✦♣, Che Liu, Junhong Wu,
Fangzhen Lin, Wenhu Chen


the Hong Kong University of Science and Technology, University of Waterloo, M-A-P,
Tsinghua University, Imperial College London, UCAS


Corresponding to: jasper.whz@outlook.com, wenhu.chen@uwaterloo.ca

đź””News

🔥[2025-09-09] The paper is out 🚀. We are now working on releasing code and models.

How Do LLMs *Really* Learn to Reason?

Reinforcement Learning (RL) has been a game-changer for teaching LLMs complex reasoning, but how it works has been a mystery. Puzzling behaviors like sudden "aha moments," and performance boosts from longer answers ("length-scaling") have been observed, but not understood.

In this work, we reveal that these are not random quirks. They are the hallmarks of an emergent reasoning hierarchy, where the model learns to reason much like a human: by separating high-level strategic planning from low-level procedural execution. We show this process unfolds in two overlapping phases and leverage this insight to create a more efficient RL algorithm.

Diagram illustrating the separation of high-level planning and low-level execution in reasoning.

A Two-Phase Dynamic to Enhanced Reasoning through RL

Our analysis reveals that RL-trained LLMs don't improve monolithically. Instead, they follow a two-phase learning dynamic where the "bottleneck" to better performance shifts over time.

Phase 1: Forging a Reliable Procedural Engine. Initially, the model focuses on mastering the basics. It learns to perform low-level steps like formatting, arithmetic and variable substitutions reliably. We observe this as a sharp drop in uncertainty (perplexity and token entropy) for these "execution tokens."

Graphs showing training dynamics: token entropy of execution tokens decreases while semantic entropy of planning tokens increases.

We track the training Dynamics of representative model families. The curves reveal a two-phase dynamics. Seen from the first two columns, the model has an initial focus on procedural consolidation, marked by sharp decrease in model perplexity (greater confidence) and token entropy (more certain) of execution tokens. This follows a shift to exploring strategic planning, evident from the third column. The diversity of strategic plans (semantic entropy) steadily increases on Qwen models or takes a turn to increase on Llama, correlating with consistently improved accuracy and longer reasoning chains (fourth column).

For strong models or with easy-to-learn data, this phase can be brief or even absent, as the model already possess reliable mastery of foundational low-level skills, often only requiring minimal adjustment on formatting tokens.

Phase 2: Mastering High-Level Strategic Planning. Once the model lays a solid foundation on its low-level skills, the learning frontier shifts. Performance gains are now driven by exploring and mastering high-level strategies—like choosing a new approach, backtracking, or identifying a key theorem.

We confirm this by measuring the semantic entropy of the model's planning tokens, which suggests the diversity of the model's high-level strategic plans. The semantic entropy of planning tokens (red line, third column) shows a steady increase from the beginning or from a turning point. This increasing semantic entropy is parallel with increasing reasoning accuracy and length scaling. This observation suggests that the policy is actively expanding its repertoire of strategic plans to obtain sustained improvement in reasoning simultaneously. This contrasts sharply with the sharp decrease in token-level entropy seen during the initial procedural consolidation phase.

Explaining Puzzling Phenomena

Our hierarchical framework provides a unified explanation for previously mysterious phenomena observed during RL training:

  • "Aha Moments": These aren't random flashes of brilliance. An "aha moment" is the behavioral signature of the model discovering, mastering, and reinforcing a new, powerful high-level strategy, like self-reflection.
  • "Length-Scaling": Performance improving with longer outputs is a direct result of better planning. As a model explores more diverse and sophisticated strategies—involving case analysis, planning, and backtracking—it naturally produces longer, more structured, and more successful reasoning traces.
  • Complex Entropy Dynamics: The often-confusing trend of overall token-level entropy is demystified. It decreases because the vast number of low-level *execution* tokens become predictable with training. This masks the real story: the increasing *semantic* entropy of high-level *planning* tokens, which accurately tracks the model's exploration of new strategies.

Hierarchy-Aware Credit Assignment: Focusing on What Matters

This insight exposes a core inefficiency in current RL methods like GRPO: they apply optimization pressure agnostically to all tokens, diluting the learning signal. If the key to advanced reasoning is mastering strategy, why waste effort on already-learned procedural steps?

We introduce HICRA (Hierarchy-Aware Credit Assignment), an algorithm that concentrates optimization pressure directly on the high-impact planning tokens. By amplifying the learning signal for strategic moves, HICRA accelerates the discovery and reinforcement of effective reasoning patterns.

Results: Targeted Exploration Wins

HICRA consistently outperforms strong GRPO baselines across multiple text-only and vision-language models.

Table showing HICRA outperforming GRPO and Base models on math reasoning benchmarks.

Training Dynamics of Error Types

Our analysis shows that RL's primary benefit comes from correcting high-level strategic faults, not minor calculation errors. HICRA's focused approach is simply more efficient at this.

Identify Planning Tokens via Semantic Function vs. High-Entropy?

Other work has proposed using high-entropy "fork tokens" as a proxy for decision points in a model's reasoning process. We investigated the relationship between these entropy-based tokens and our functionally-defined planning tokens.

We found a crucial asymmetry: while most of our planning tokens do exhibit high entropy (as expected for strategic choices), the reverse is not true. Most high-entropy tokens are not planning tokens. They often correspond to simple variations in phrasing or low-level calculations that don't change the overall strategy. This highlights the limitation of solely using entropy to identify tokens that have exact semantic functions.

Graphs showing that most planning tokens are high-entropy, but most high-entropy tokens are not planning tokens.

While a majority of functionally-defined planning tokens are high-entropy (left), high-entropy tokens are not a good proxy for planning tokens, as most of them serve other functions (right).

What metric is a good compass for tracking exploration?

Measuring strategic exploration accurately is crucial for diagnosing policy learning. However, we find that common metric like token-level entropy can be misleading.

  • Token-Level Entropy's Flaw: This metric sometimes converges to a lower level, interpreted as "collapses" by practitioners. However, this is incorrect. Token-level entropy is dominated by the vast number of low-level execution tokens, which are often destined to become predictable, i.e., with low entropy. The decrease of token entropy in these low-level tokens pulls the global average token entropy down. But this does not imply that exploration has ceased. In contrast, as long as the semantic entropy remains a high level, the model is actively exploring new high-level strategies, and the performance continues to improve.
  • Pass@K's Blind Spot: This metric, which measures success rate in K attempts, can sometimes saturate, e.g., all queries has chances to be solved, making it useless for distinguishing between methods or tracking ongoing learning dynamics later in training.

Semantic Entropy avoids these pitfalls. It directly measures the diversity of meaningful strategic plans. As shown below, semantic entropy remains a powerful differentiator, revealing HICRA’s continued strategic exploration even when token entropy has collapsed and Pass@8 has saturated. This makes it a far more reliable compass for tracking true reasoning development.

Graphs showing token entropy collapsing and Pass@8 saturating, while semantic entropy continues to differentiate HICRA from GRPO.

Token entropy (far right) collapses and Pass@8 (second from right) saturates, becoming useless. In contrast, Semantic Entropy (far left) clearly shows HICRA's sustained exploration advantage, which correlates with better final accuracy.

Reference

If you find our work useful, please give us a free cite:

@article{hicra,
    title={Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning},
    author={Wang, Haozhe and Xu, Qixin and Liu, Che and Wu, Junhong and Lin, Fangzhen and Chen, Wenhu},
    journal={arXiv preprint:2509.03646},
    year={2025}
}