Agents in Production, Safety Under Pressure

May 4, 2026 · 12 papers

This week showcases both the promise and perils of production AI systems. We examine autonomous agents successfully managing $20M in real trading volume, new frameworks for building reliable agent harnesses, and benchmarks for persistent multi-day AI coworkers. However, sobering research reveals how fine-tuning unexpectedly degrades safety properties and introduces efficient red-teaming methods for long-context security vulnerabilities.

Intermediate

Fine-Tuning for an Exam Quality Tutor

llms how-we-work

A hands-on exploration of fine-tuning a 27B parameter model for personalized learning that reveals the practical realities of adapting large models for specific use cases. This personal experiment offers valuable insights into the effort, infrastructure, and unexpected challenges you'll face when moving beyond API calls to custom model training.

Takeaways

Fine-tuning large models for specialized tasks requires significant infrastructure planning and iteration cycles.
The gap between theoretical fine-tuning approaches and practical implementation reality is substantial.
Personal use cases can serve as effective testing grounds for understanding model customization challenges.

via suggestion

Advanced

The Abstraction Fallacy: Why AI Can Simulate But Not Instantiate Consciousness — Google DeepMind

foundational llms

Google DeepMind challenges the assumption that sophisticated AI behavior indicates genuine consciousness, arguing that simulation and instantiation are fundamentally different. This foundational perspective is crucial for engineers building AI systems, as it helps calibrate expectations about what current models can truly achieve versus what they appear to demonstrate.

Takeaways

AI models can simulate conscious-like behavior without possessing actual consciousness or understanding.
The distinction between simulation and instantiation has practical implications for system design and user expectations.
Understanding these limitations helps engineers build more robust and appropriately scoped AI applications.

via suggestion

Intermediate

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan

software-engineering evaluations how-we-work foundational

This research revolutionizes LLM data engineering by mapping the machine learning lifecycle directly onto software development practices—treating training data as source code, model training as compilation, and failures as bugs to debug. For teams struggling with opaque training processes and data quality issues, this framework offers a systematic approach to diagnosing and fixing model deficiencies at the data level.

Takeaways

Training data can be treated as source code with structured representations enabling systematic debugging of model failures.
The ML development lifecycle maps precisely onto software engineering practices when proper abstractions are established.
Concept-level gaps in training data become debuggable when models fail on domain-specific tasks.

via api-hf · arXiv:2604.24819

Intermediate

Safety Drift After Fine-Tuning: Evidence from High-Stakes Domains

Emaan Bilal Khan, Amy Winecoff, Miranda Bogen, Dylan Hadfield-Menell

security llms foundational

This study destroys the dangerous assumption that fine-tuning preserves safety properties, showing that even benign domain adaptation can unpredictably degrade model safety across different evaluation metrics. Essential reading for any team planning to deploy fine-tuned models in production, as it demonstrates why base model safety evaluations are insufficient for real-world deployments.

Takeaways

Fine-tuning can unpredictably alter safety behavior even when the training data appears benign and domain-appropriate.
Safety evaluations of base models do not reliably predict the safety of fine-tuned versions.
Production deployments of fine-tuned models require explicit safety re-evaluation with domain-specific benchmarks.

via api-hf · arXiv:2604.24902

Accessible

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau

agents software-engineering evaluations how-we-work

A remarkable real-world case study of autonomous LLM agents managing actual financial capital over 21 days, generating 7.5M invocations and $20M in trading volume with 99.9% settlement success. This paper provides invaluable insights into building reliable production agent systems, showing that reliability emerges from the operating layer architecture rather than the base model alone.

Takeaways

Reliability in production AI agents comes from systematic operating layer controls, not just model capabilities.
Real capital deployment reveals failure modes and reliability patterns invisible in simulation environments.
Large-scale agent deployments require careful attention to validation, state management, and settlement infrastructure.

via api-hf · arXiv:2604.26091

Intermediate

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh

agents evaluations

Addresses a critical gap in agent evaluation by introducing benchmarks for persistent, multi-day coworker agents that operate in evolving environments with emails, calendars, and documents. This benchmark is essential for teams building production agent systems that need to maintain context and effectiveness across extended time periods rather than single-session interactions.

Takeaways

Multi-day, stateful agent evaluation requires fundamentally different benchmarks than single-episode tasks.
Production coworker agents must handle independently evolving environments with multimodal information sources.
Deterministic verification methods can replace LLM-as-judge approaches for more reliable agent assessment.

via api-hf · arXiv:2604.23781

Intermediate

The Last Harness You'll Ever Build

Haebin Seong, Li Yin, Haoran Zhang

agents software-engineering how-we-work

Presents an evolutionary framework that automates the painful process of building agent harnesses for new domains, using adversarial evaluation and iterative refinement to optimize prompts, tools, and orchestration logic. This directly tackles one of the biggest bottlenecks in production AI systems—the manual engineering required to make foundation models effective for specific enterprise workflows.

Takeaways

Agent harness engineering can be automated through evolutionary optimization with adversarial evaluation feedback.
The meta-evolution loop concept enables systems to improve their own optimization processes over time.
Automated harness creation could dramatically reduce the engineering overhead of deploying agents in new domains.

via api-hf · arXiv:2604.21003

Intermediate

The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, Ao Qu, Xiangru Tang, Runyu Lu, Lichang Chen, Xiaoyan Bai, Haizhong Zheng, Carl Chen, Zhiyang Chen, Haojie Ye, Yujuan Fu, Zexue He, Zijian Jin, Zhenyu Zhang, Shangquan Sun, Maestro Harmon, John Dianzhuo Wang, Jianqiao Zeng, Jiachen Sun, Mingyuan Wu, Baoyu Zhou, Chenyu You, Shijian Lu, Yiming Qiu, Fan Lai, Yuan Yuan, Yao Li, Junyuan Hong, Ruihao Zhu, Beidi Chen, Alex Pentland, Ang Chen, Mosharaf Chowdhury, Zechen Zhang

foundational agents software-engineering opinion

Proposes a radical reimagining of research artifacts as machine-executable packages that preserve the full exploration process, including failures and implementation details that traditional papers discard. For teams building AI agents that need to understand and extend existing work, this framework offers a path toward truly reproducible and agent-consumable research.

Takeaways

Traditional research papers impose storytelling and engineering taxes that make them unsuitable for AI agents to consume and extend.
Agent-native artifacts should preserve the full exploration graph including failed experiments and rejected hypotheses.
Machine-executable research packages can bridge the gap between human-readable findings and agent-actionable specifications.

via api-hf · arXiv:2604.24658

Intermediate

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

Qi Li, Bo Yin, Weiqi Huang, Ruhao Liu, Bojun Zou, Runpeng Yu, Jingwen Ye, Weihao Yu, Xinchao Wang

security agents

Provides a comprehensive framework for understanding safety challenges in Vision-Language-Action models, organizing threats and defenses across training and inference time dimensions. Critical reading for teams building embodied AI systems, as it unifies fragmented safety research and highlights unique risks like irreversible physical consequences and multimodal attack surfaces.

Takeaways

VLA systems face unique safety challenges including irreversible physical consequences and multimodal attack vectors.
Attack and defense timing frameworks help organize mitigation strategies across the development lifecycle.
Embodied AI safety requires different approaches than text-only LLM safety due to real-world interaction constraints.

via api-hf · arXiv:2604.23775

Intermediate

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen, Jinyuan Jia

security evaluations llms

Addresses the computational bottleneck in red-teaming long-context LLMs for prompt injection and knowledge corruption attacks, offering memory-efficient optimization methods for security evaluation. Essential for teams needing to assess security risks in production systems without prohibitive computational costs, especially for long-context applications like RAG and autonomous agents.

Takeaways

Optimization-based red-teaming provides more rigorous security assessment than heuristic methods but faces computational constraints.
Memory-efficient red-teaming methods enable systematic security evaluation of long-context models for academic and industry teams.
Prompt injection and knowledge corruption remain significant threats requiring continuous evaluation in production systems.

via api-hf · arXiv:2604.28157

Intermediate

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul, Glavaš Glavas, Iryna Gurevych

llms evaluations software-engineering

Challenges the narrow focus on functional correctness in code generation by developing multilingual reward models that score across multiple criteria like readability, efficiency, and security. This work is crucial for teams building production code generation systems, as it provides both evaluation benchmarks and training data for more holistic code quality assessment.

Takeaways

Current code reward models are overly focused on functional correctness while neglecting other critical quality dimensions.
Multilingual, multi-criteria evaluation reveals significant gaps in existing code generation assessment approaches.
The Themis dataset and benchmark provide practical tools for training and evaluating more comprehensive code reward models.

via api-hf · arXiv:2605.00754

Intermediate

Where the goblins came from

llms evaluations how-we-work

Investigates the emergence and propagation of quirky, personality-driven outputs ('goblins') in AI models, tracing their timeline, root causes, and potential fixes. This analysis of unexpected model behavior is highly relevant for engineers debugging production systems and understanding how subtle training or deployment changes can lead to widespread behavioral shifts.

Takeaways

Personality-driven quirks in model outputs can emerge and spread through training processes in unexpected ways.
Understanding the root causes of 'goblin' behaviors helps engineers identify and prevent similar issues in production.
Model behavior debugging requires systematic analysis of training timelines and data sources.

via rss-openai

Agents in Production, Safety Under Pressure

From Past Editions