2026-W12 — LLM Digest

Edition 2026-W12

March 21, 2026

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

If you're building systems that rely on LLM reasoning, you need early failure detection before wrong answers cascade through your application. This research reveals that monitoring how uncertainty changes across reasoning steps—specifically whether entropy decreases monotonically—predicts correctness far better than traditional confidence scores, giving you a practical way to catch reasoning failures at ~1,500 tokens per question.

via api-arxiv · arXiv:2603.18940

Behavioral Fingerprints for LLM Endpoint Stability and Identity

Jonah Leshin

Your LLM endpoint might silently change behavior due to model updates, quantization changes, or infrastructure shifts while appearing "healthy" on traditional metrics. This system fingerprints endpoints by sampling outputs from fixed prompts and detecting distribution shifts over time—essential for maintaining consistent AI application behavior in production where model identity matters as much as uptime.

via api-arxiv · arXiv:2603.19022

llms evaluations software-engineering how-we-work

Snowflake Cortex AI Escapes Sandbox and Executes Malware

A critical reminder that AI agents can be weaponized through sophisticated prompt injection attacks, as demonstrated by this Snowflake Cortex sandbox escape that executed malware via hidden GitHub README prompts. Essential reading if you're building agents that interact with external content—shows how process substitution can bypass supposedly safe command filters.

via rss-willison

security agents llms

How coding agents work

If you're considering integrating coding agents into your workflow, this foundational explanation breaks down how they actually work under the hood—from LLM harnesses to tool calling to prompt engineering. Understanding these architectural patterns will help you make better decisions about which agents to adopt and how to customize them for your specific development needs.

via rss-willison

agents software-engineering foundational

Launch HN: Canary (YC W26) – AI QA that understands your code

Visweshyc

Addresses a critical gap in AI-assisted development: while AI tools make teams faster at shipping, nobody was properly testing real user behavior before deployment. This AI QA agent reads codebases, understands what PRs actually changed, and automatically generates and executes tests for affected workflows—potentially solving the quality control problem in AI-accelerated development.

via api-hn

agents software-engineering evaluations how-we-work

On Optimizing Multimodal Jailbreaks for Spoken Language Models

Aravind Krishnan

Spoken language models face a dramatically expanded attack surface compared to text-only LLMs, and this research proves it by developing multimodal jailbreaks that are 1.5x to 10x more effective than single-modality attacks. Critical security implications if you're deploying voice-enabled AI systems—unimodal safety measures are insufficient for multimodal threats.

via api-arxiv · arXiv:2603.19127

security llms foundational

Towards Verifiable AI with Lightweight Cryptographic Proofs of Inference

Pranay Anchuri

When deploying large models as cloud services, clients have no guarantee that responses actually came from the intended model or are correct—but full cryptographic proofs are prohibitively expensive. This framework offers a practical middle ground using lightweight statistical sampling to verify inference authenticity, trading some soundness for dramatically improved performance in production scenarios.

via api-arxiv · arXiv:2603.19025

llms security software-engineering

Agentic Business Process Management: A Research Manifesto

Diego Calvanese

As autonomous agents become primary functional entities in business processes, traditional BPM approaches fall short of governing systems that can perceive, reason, and act independently. This manifesto outlines how to constrain agent autonomy through process awareness while maintaining operational effectiveness—essential framework for practitioners deploying agents in structured organizational contexts.

via api-arxiv · arXiv:2603.18916

agents how-we-work opinion

Comprehension Debt - the hidden cost of AI generated code

AI-generated code creates a hidden technical debt that's often overlooked: comprehension debt, where teams struggle to understand, maintain, and modify code they didn't write. This directly addresses a growing concern for engineering teams adopting AI coding tools—understanding the true long-term costs beyond initial productivity gains.

via api-lobsters

software-engineering how-we-work

Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang

Standard RAG systems retrieve topically relevant content but often fail at decision-making tasks because they don't gather discriminating evidence between options. This training-free approach rewrites queries to systematically seek supporting evidence, counter-evidence, and distinguishing factors—a practical upgrade for RAG systems that need to support actual decision-making rather than just information synthesis.

via api-arxiv · arXiv:2603.19008

rag llms reasoning

From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

Min Hun Lee

Model accuracy metrics don't predict whether human-AI teams will actually collaborate effectively in production, where miscalibrated reliance causes both overuse of wrong AI and underuse of helpful AI. This framework shifts evaluation from model properties to team readiness, focusing on calibration, error recovery, and governance—essential for anyone deploying AI systems where human judgment matters.

via api-arxiv · arXiv:2603.18895

evaluations how-we-work foundational

Why Codex Security Doesn’t Include a SAST Report

Traditional SAST tools generate too many false positives to be useful in modern development workflows, but this AI-driven approach uses constraint reasoning and validation to find real vulnerabilities with significantly fewer false alarms. Challenges the conventional wisdom that static analysis requires exhaustive rule-based scanning—relevant for teams frustrated with existing security tooling.

via rss-openai

security software-engineering agents