menu
close

DAPO: Open-Source Breakthrough Revolutionizes AI Reasoning

Researchers from ByteDance and Tsinghua University have released DAPO, a fully open-source reinforcement learning system that achieves state-of-the-art mathematical reasoning capabilities. The system outperforms previous models while using 50% fewer training steps and makes previously concealed technical details accessible to the broader AI community. This breakthrough addresses the transparency gap in advanced AI reasoning systems, enabling wider innovation and reproducibility.
DAPO: Open-Source Breakthrough Revolutionizes AI Reasoning

In a significant advancement for open-source artificial intelligence, researchers from ByteDance and Tsinghua University have unveiled DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), a groundbreaking reinforcement learning system that achieves exceptional reasoning capabilities while prioritizing transparency and accessibility.

DAPO represents a direct response to the AI community's struggle with reproducing state-of-the-art reinforcement learning results due to concealed technical details from major industry players like OpenAI and DeepSeek. Reinforcement learning has become central to advancing Large Language Models (LLMs), empowering them with improved reasoning capabilities necessary for complex tasks. However, the research community faces considerable challenges in reproducing state-of-the-art RL techniques due to incomplete disclosure of key training details by major industry players. This opacity has limited the progress of broader scientific efforts and collaborative research.

The system achieves 50 points on the challenging AIME 2024 mathematical competition using the Qwen2.5-32B base model. Unlike previous works that withhold training details, DAPO introduces four key techniques that make large-scale LLM reinforcement learning successful. Additionally, the researchers have open-sourced their training code, built on the verl framework, along with a carefully curated and processed dataset.

What makes DAPO particularly impressive is its efficiency. It outperforms the previous state-of-the-art DeepSeek-R1-Zero-Qwen-32B while using only 50% of the training steps. This efficiency stems from four core innovations: The first, "Clip-Higher," addresses the issue of entropy collapse, a situation where models prematurely settle into limited exploration patterns. By carefully managing the clipping ratio in policy updates, this technique encourages greater diversity in model outputs. "Dynamic Sampling" counters inefficiencies in training by dynamically filtering samples based on their usefulness, thus ensuring a more consistent gradient signal. The "Token-level Policy Gradient Loss" offers a refined loss calculation method, emphasizing token-level rather than sample-level adjustments to better accommodate varying lengths of reasoning sequences. Lastly, "Overlong Reward Shaping" introduces a controlled penalty for excessively long responses, gently guiding models toward concise and efficient reasoning.

The DAPO release comes amid a surge in open-source reinforcement learning breakthroughs. Another notable advancement is MiroMind-M1, a fully open-source pipeline spanning datasets, models, training code, and evaluation scripts that sets new standards for openness and state-of-the-art mathematical reasoning within the Qwen-2.5 model ecosystem. MiroMind-M1 is built on the robust Qwen-2.5 backbone, with enhancements geared explicitly for mathematical reasoning.

The industry impact of these developments is substantial, with the reinforcement learning sector assessed at $122+ billion in 2025. Its applications span robotics, autonomous vehicles, supply chain optimization, healthcare, and gaming, with use cases expanding as the technology matures.

By making previously inaccessible methodologies fully transparent, DAPO and similar open-source initiatives are democratizing advanced AI capabilities, enabling researchers, startups, and established companies to build upon these innovations without the constraints of proprietary systems.

Source:

Latest News