🔥[2025-09-09] The paper is out 🚀. We are now working on releasing code and models.
Reinforcement Learning (RL) has been a game-changer for teaching LLMs complex reasoning, but how it works has been a mystery. Puzzling behaviors like sudden "aha moments," and performance boosts from longer answers ("length-scaling") have been observed, but not understood.
In this work, we reveal that these are not random quirks. They are the hallmarks of an emergent reasoning hierarchy, where the model learns to reason much like a human: by separating high-level strategic planning from low-level procedural execution. We show this process unfolds in two overlapping phases and leverage this insight to create a more efficient RL algorithm.
Our analysis reveals that RL-trained LLMs don't improve monolithically. Instead, they follow a two-phase learning dynamic where the "bottleneck" to better performance shifts over time.
Phase 1: Forging a Reliable Procedural Engine. Initially, the model focuses on mastering the basics. It learns to perform low-level steps like formatting, arithmetic and variable substitutions reliably. We observe this as a sharp drop in uncertainty (perplexity and token entropy) for these "execution tokens."
We track the training Dynamics of representative model families. The curves reveal a two-phase dynamics. Seen from the first two columns, the model has an initial focus on procedural consolidation, marked by sharp decrease in model perplexity (greater confidence) and token entropy (more certain) of execution tokens. This follows a shift to exploring strategic planning, evident from the third column. The diversity of strategic plans (semantic entropy) steadily increases on Qwen models or takes a turn to increase on Llama, correlating with consistently improved accuracy and longer reasoning chains (fourth column).
For strong models or with easy-to-learn data, this phase can be brief or even absent, as the model already possess reliable mastery of foundational low-level skills, often only requiring minimal adjustment on formatting tokens.
Phase 2: Mastering High-Level Strategic Planning. Once the model lays a solid foundation on its low-level skills, the learning frontier shifts. Performance gains are now driven by exploring and mastering high-level strategies—like choosing a new approach, backtracking, or identifying a key theorem.
We confirm this by measuring the semantic entropy of the model's planning tokens, which suggests the diversity of the model's high-level strategic plans. The semantic entropy of planning tokens (red line, third column) shows a steady increase from the beginning or from a turning point. This increasing semantic entropy is parallel with increasing reasoning accuracy and length scaling. This observation suggests that the policy is actively expanding its repertoire of strategic plans to obtain sustained improvement in reasoning simultaneously. This contrasts sharply with the sharp decrease in token-level entropy seen during the initial procedural consolidation phase.
Our hierarchical framework provides a unified explanation for previously mysterious phenomena observed during RL training:
This insight exposes a core inefficiency in current RL methods like GRPO: they apply optimization pressure agnostically to all tokens, diluting the learning signal. If the key to advanced reasoning is mastering strategy, why waste effort on already-learned procedural steps?
We introduce HICRA (Hierarchy-Aware Credit Assignment), an algorithm that concentrates optimization pressure directly on the high-impact planning tokens. By amplifying the learning signal for strategic moves, HICRA accelerates the discovery and reinforcement of effective reasoning patterns.
HICRA consistently outperforms strong GRPO baselines across multiple text-only and vision-language models.
Our analysis shows that RL's primary benefit comes from correcting high-level strategic faults, not minor calculation errors. HICRA's focused approach is simply more efficient at this.
Other work has proposed using high-entropy "fork tokens" as a proxy for decision points in a model's reasoning process. We investigated the relationship between these entropy-based tokens and our functionally-defined planning tokens.
We found a crucial asymmetry: while most of our planning tokens do exhibit high entropy (as expected for strategic choices), the reverse is not true. Most high-entropy tokens are not planning tokens. They often correspond to simple variations in phrasing or low-level calculations that don't change the overall strategy. This highlights the limitation of solely using entropy to identify tokens that have exact semantic functions.
While a majority of functionally-defined planning tokens are high-entropy (left), high-entropy tokens are not a good proxy for planning tokens, as most of them serve other functions (right).
Measuring strategic exploration accurately is crucial for diagnosing policy learning. However, we find that common metric like token-level entropy can be misleading.
Semantic Entropy avoids these pitfalls. It directly measures the diversity of meaningful strategic plans. As shown below, semantic entropy remains a powerful differentiator, revealing HICRA’s continued strategic exploration even when token entropy has collapsed and Pass@8 has saturated. This makes it a far more reliable compass for tracking true reasoning development.
Token entropy (far right) collapses and Pass@8 (second from right) saturates, becoming useless. In contrast, Semantic Entropy (far left) clearly shows HICRA's sustained exploration advantage, which correlates with better final accuracy.
@article{hicra,
title={Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning},
author={Wang, Haozhe and Xu, Qixin and Liu, Che and Wu, Junhong and Lin, Fangzhen and Chen, Wenhu},
journal={arXiv preprint:2509.03646},
year={2025}
}