LLM News Digest

Tag: reasoning

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
Accessible

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung

This overturns conventional wisdom about many-shot in-context learning for reasoning tasks. While more examples help with simple tasks, reasoning tasks show unstable scaling behavior, and semantic similarity-based retrieval actually hurts performance. The order of examples matters more than previously thought. This has immediate implications for how you structure prompts and manage context in reasoning-heavy production systems.

Takeaways
  • Many-shot scaling rules for non-reasoning tasks don't apply to reasoning tasks and can degrade performance.
  • Semantic similarity poorly predicts procedural compatibility in chain-of-thought reasoning.
  • Example ordering significantly impacts performance and requires careful consideration in production prompt design.
from May 18, 2026 · via api-hf · arXiv:2605.13511
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
Intermediate

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna

This fundamentally changes how you should think about RL fine-tuning—it reveals that RL doesn't teach models new reasoning strategies but simply redistributes probability mass toward solutions already in the base model. The effect is incredibly sparse (1-3% of tokens), concentrated at high-entropy decision points, and the base model's own uncertainty can predict exactly where these corrections occur without any RL training.

Takeaways
  • RL fine-tuning redistributes existing model knowledge rather than teaching new capabilities.
  • Only 1-3% of token positions are affected, concentrated at high-entropy decision points.
  • Base model entropy alone can predict where RL corrections will occur.
from May 11, 2026 · via api-hf · arXiv:2605.06241
Terence Tao (@tao@mathstodon.xyz)
Intermediate

Terence Tao (@tao@mathstodon.xyz)

Terence Tao identifies a critical gap in AI mathematical reasoning that applies directly to software engineering: while AI can generate and verify proofs (or code), it struggles with the third component—digestion or true understanding. This creates 'proof indigestion' where solutions are technically correct but lack the deeper comprehension needed for maintenance, debugging, or extension, a problem that simply training AI to write better explanations won't fully solve.

Takeaways
  • AI excels at generation and verification but fails at deep understanding and explanation.
  • Technical correctness doesn't guarantee maintainable or understandable solutions.
  • Simply automating explanation generation won't solve the fundamental comprehension gap.
from May 11, 2026 · via manual
KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Intermediate

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo

KWBench introduces the first benchmark for unprompted problem recognition in professional contexts, testing whether LLMs can identify the underlying structure of a situation before attempting to solve it. This addresses a critical gap in current evaluations that assume the problem is already clearly defined, making it essential for understanding how LLMs perform in real knowledge work where recognizing what type of problem you're facing is half the battle.

Takeaways
  • Current LLM benchmarks assume problems are already clearly defined, missing the crucial step of recognizing what type of situation you're facing.
  • The benchmark tests game-theoretic pattern recognition across professional domains like acquisitions, contract negotiations, and fraud analysis.
  • Unprompted problem recognition is a fundamental capability gap that affects how well LLMs can assist with real knowledge work.
from Apr 27, 2026 · via api-hf · arXiv:2604.15760
Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
Intermediate

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam

SLIDERS challenges the conventional chunk-and-aggregate approach to document QA by extracting information into a relational database and reasoning with SQL instead of concatenated text. This architectural approach sidesteps the fundamental limitation that any fixed context window will eventually be exceeded, making it essential reading for engineers building document analysis systems that need to scale beyond typical RAG limitations.

Takeaways
  • Traditional chunk-and-aggregate approaches hit an aggregation bottleneck as document collections grow, even with infinite context windows.
  • Extracting information into structured databases and reasoning with SQL scales better than reasoning over concatenated text.
  • Data reconciliation using provenance and extraction rationales is crucial for maintaining coherence in locally extracted information.
from Apr 27, 2026 · via api-hf · arXiv:2604.22294
Self-Execution Simulation Improves Coding Models
Intermediate

Self-Execution Simulation Improves Coding Models

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, Yossi Adi

Code LLMs struggle because they can't accurately predict what their generated code will do when executed, leading to logical errors that escape syntax checking. This research trains models to simulate program execution step-by-step, enabling self-verification and iterative debugging of their own code. The approach combines supervised learning on execution traces with reinforcement learning, achieving significant improvements on competitive programming benchmarks and providing a foundation for more reliable AI coding assistants.

Takeaways
  • Teaching models to simulate execution enables self-verification and iterative debugging of generated code.
  • Combining execution simulation training with reinforcement learning significantly improves competitive programming performance.
  • Step-by-step execution traces provide grounding that helps models understand and debug their logical reasoning in code.
from Apr 13, 2026 · via api-hf · arXiv:2604.03253
Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun
Advanced

Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

Stanford researchers discuss Moonlake, their approach to building causal world models that understand multimodal interactions and can efficiently reason about cause and effect in complex environments. This foundational research explores how AI systems can develop better understanding of how the world works, which is crucial for building more capable agents that can plan and reason about their actions.

Takeaways
  • Causal world models enable AI systems to understand cause-and-effect relationships rather than just correlations.
  • Multimodal approaches help models build more comprehensive understanding of how actions affect environments.
  • Efficient world models are essential for practical agent deployment in real-world scenarios.
from Apr 6, 2026 · via rss-latentspace
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Intermediate

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Richard J. Young

Challenges the conventional wisdom that faithfulness in chain-of-thought reasoning is an objective metric. Testing three different classifiers on identical data produced faithfulness rates ranging from 69% to 83% — a massive difference that undermines most CoT evaluation literature. Essential if you're building evaluation pipelines for reasoning systems, as it shows your measurement approach fundamentally shapes your conclusions.

Takeaways
  • Faithfulness measurements in chain-of-thought evaluation vary dramatically (69% to 83%) depending on the classifier used, making evaluation methodology critical.
  • Your measurement approach fundamentally shapes conclusions about reasoning system performance, not just the system itself.
  • Evaluation pipelines for reasoning systems need multiple measurement approaches to avoid classifier bias.
from Mar 23, 2026 · via api-arxiv · arXiv:2603.20172
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
Advanced

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Chiyu Ma

Introduces FIPO, a reinforcement learning algorithm that breaks through the reasoning stagnation plaguing current LLMs by using fine-grained credit assignment instead of uniform token rewards. Extends chain-of-thought reasoning from 4,000 to over 10,000 tokens and boosts mathematical problem-solving accuracy from 50% to 58%. Directly applicable if you're building or fine-tuning models for complex reasoning tasks.

Takeaways
  • FIPO uses fine-grained credit assignment instead of uniform token rewards to extend reasoning from 4,000 to over 10,000 tokens.
  • Mathematical problem-solving accuracy improved from 50% to 58% by breaking through reasoning stagnation in current LLMs.
  • This reinforcement learning approach is directly applicable for fine-tuning models on complex reasoning tasks.
from Mar 23, 2026 · via api-arxiv · arXiv:2603.19835
The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus
Advanced

The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus

Amartya Roy

Replaces the chaotic read-eval-print loops of existing recursive language models with a structured functional programming approach grounded in λ-calculus. This provides formal guarantees like termination and cost bounds that standard recursive LLMs lack, making long-context reasoning predictable and analyzable. Critical if you're building production systems that need reliable recursive reasoning without the execution risks of arbitrary code generation.

Takeaways
  • Replacing chaotic read-eval-print loops with λ-calculus provides formal guarantees like termination and cost bounds for recursive LLMs.
  • This structured functional programming approach makes long-context reasoning predictable and analyzable unlike arbitrary code generation.
  • Production systems requiring reliable recursive reasoning need formal execution frameworks rather than unstructured recursion.
from Mar 23, 2026 · 0 citations · via api-arxiv · arXiv:2603.20105