menu
close

MIT's AI Coach Boosts Language Models' Problem-Solving Abilities

MIT researchers have developed CodeSteer, an intelligent assistant that guides large language models to switch between text and code generation until correctly answering complex queries. The system increased LLM accuracy on symbolic tasks like math problems and spatial reasoning by more than 30%, enabling less sophisticated models to outperform more advanced ones. This breakthrough could significantly improve AI problem-solving capabilities for complex tasks in robotics, supply chain management, and other fields requiring precise computational reasoning.
MIT's AI Coach Boosts Language Models' Problem-Solving Abilities

Large language models (LLMs) excel at understanding context and providing logical answers through textual reasoning. However, they often struggle with computational tasks that would be better solved using code, such as comparing decimal numbers or solving optimization problems.

To address this limitation, researchers from MIT have developed CodeSteer, a smart assistant that acts as a coach for larger language models, guiding them to switch between text and code generation until they correctly answer a query.

"We were inspired by humans. In sports, a trainer may not be better than the star athlete on the team, but the trainer can still give helpful suggestions to guide the athlete. This steering method works for LLMs, too," explains Yongchao Chen, a graduate student at Harvard and MIT who worked on the project.

CodeSteer, itself a smaller LLM fine-tuned on the Llama-3-8B model, works by reviewing a query and determining whether text or code would be more suitable for solving the problem. It then generates prompts for the larger LLM, guiding it to use the appropriate method. If the answer isn't correct, CodeSteer continues prompting the LLM to try different approaches until it reaches the correct solution.

The researchers found that augmenting GPT-4o with CodeSteer boosted its accuracy on symbolic tasks by more than 30%, raising its average performance score from 53.3 to 86.4 across 37 tasks. This improvement enabled it to outperform even more advanced models like OpenAI's o1 (82.7) and DeepSeek R1 (76.8). Remarkably, CodeSteer also demonstrated strong generalizability, providing an average 41.8% performance boost when applied to other models like Claude, Mistral, and GPT-3.5.

To develop and test CodeSteer, the researchers created SymBench, a comprehensive benchmark comprising 37 symbolic tasks with adjustable complexity. These tasks span mathematics, spatial reasoning, logic, order reasoning, and optimization problems.

This breakthrough could significantly improve AI problem-solving capabilities for complex tasks that are difficult to solve with textual reasoning alone, such as generating paths for robots in uncertain environments or scheduling shipments in international supply chains.

"By augmenting an LLM with the ability to smartly use coding, we can take a model that is already very strong and improve its performance even more," Chen notes. The researchers are now working to streamline CodeSteer to speed up its iterative prompting process and exploring how to fine-tune a unified model that can switch between textual reasoning and code generation without relying on a separate assistant.

Source: Techxplore

Latest News