Tag: evaluations

Intermediate

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

This benchmark exposes the embarrassing gap between synthetic agent evaluations and real-world performance. While most benchmarks use mock APIs and toy tasks, WildClawBench runs agents in actual CLI environments with real tools for 8+ minute tasks. The results are sobering—even frontier models like Claude Opus achieve only 35% success rates. If you're building production agents, this benchmark reveals what you're actually up against.

Takeaways

Synthetic benchmarks dramatically overestimate real-world agent performance in production environments.
Long-horizon tasks in native runtimes reveal fundamental limitations even in frontier models.
Production agent deployment requires significantly different evaluation criteria than academic benchmarks suggest.

from May 18, 2026 · via api-hf · arXiv:2605.10912

Intermediate

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li

security agents evaluations

Hidden malicious intent across multiple dialogue turns represents a sophisticated attack vector that current guardrails miss. This research provides both detection methods and the Multi-Turn Intent Dataset for training systems to identify when seemingly innocent conversations accumulate into harmful instructions. Critical for anyone deploying conversational AI systems that need to detect distributed attacks rather than just obvious single-turn violations.

Takeaways

Multi-turn attacks can bypass safety measures by distributing malicious intent across seemingly benign interactions.
Turn-level intervention requires precise detection of harm-enabling closure points without premature refusal.
Production conversational systems need specialized guardrails for accumulated harmful intent detection.

from May 18, 2026 · via api-hf · arXiv:2605.05630

Accessible

Hallucinations Undermine Trust; Metacognition is a Way Forward

Gal Yona, Mor Geva, Yossi Matias

llms security evaluations foundational

Reframes the hallucination problem as confident errors rather than knowledge gaps, arguing that perfect factuality is impossible but appropriate uncertainty expression is achievable. This paper provides a practical framework for building more reliable LLM systems by focusing on metacognition—teaching models to know what they don't know—rather than trying to eliminate all errors, which preserves utility while reducing harmful overconfidence.

Takeaways

Hallucinations are fundamentally about inappropriate confidence, not just factual errors.
Perfect factuality may be impossible, but better uncertainty calibration is achievable.
Metacognitive approaches can maintain utility while reducing overconfident errors.

from May 11, 2026 · via api-hf · arXiv:2605.01428

Accessible

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Siddhant Saxena, Nilesh Trivedi, Vinayaka Jyothi

software-engineering evaluations agents how-we-work

The first comprehensive evaluation framework for AI coding platforms that treats them as virtual software agencies rather than just code generators. The 68-metric evaluation across product management, engineering, and operations reveals four critical shortcomings in current platforms: specification bottlenecks, architectural blind spots, iteration fragility, and business readiness gaps—essential insights for anyone building or evaluating AI development tools.

Takeaways

AI coding platforms need evaluation beyond code quality to include product management and operations capabilities.
Current platforms struggle with specification understanding, architectural decisions, and iterative development.
Business readiness requires capabilities spanning multiple roles, not just engineering output.

from May 11, 2026 · via api-hf · arXiv:2605.04637

Accessible

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Zhaorun Chen, Xun Liu, Haibo Tong, Chengquan Guo, Yuzhou Nie, Jiawei Zhang, Mintong Kang, Chejian Xu, Qichang Liu, Xiaogeng Liu, Tianneng Shi, Chaowei Xiao, Sanmi Koyejo, Percy Liang, Wenbo Guo, Dawn Song, Bo Li

agents security evaluations

The first comprehensive red-teaming platform specifically designed for AI agents, addressing the critical security gap as agents move from demos to production. With agents increasingly handling sensitive operations like API calls, data management, and financial transactions, DTap provides 14 real-world domains and 50+ simulation environments to systematically test how adversaries can manipulate agents into harmful actions—essential infrastructure for anyone deploying agents in production.

Takeaways

Agent security testing requires specialized tools beyond traditional LLM red-teaming approaches.
Real-world agent vulnerabilities span API key leakage, data deletion, and unauthorized transactions.
Comprehensive security evaluation needs controllable, reproducible environments across multiple domains.

from May 11, 2026 · via api-hf · arXiv:2605.04808

Intermediate

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Zhengkang Guo

agents evaluations

This benchmark directly tackles the hardest problem in agent development: maintaining reasoning quality when tools have complex dependencies and long-range interactions. The escape-room design forces agents to track hidden state, propagate intermediate results, and handle novel workflows—exactly the scenarios where production agents fail most spectacularly, with performance dropping from 90% to 60% as dependency depth increases.

Takeaways

Agent performance degrades sharply as tool dependency chains become more complex.
Current agents struggle with maintaining state across long sequences of tool interactions.
Real-world agent reliability requires testing beyond simple, isolated tool-use scenarios.

from May 11, 2026 · via api-arxiv · arXiv:2605.07926

Intermediate

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul, Glavaš Glavas, Iryna Gurevych

llms evaluations software-engineering

Challenges the narrow focus on functional correctness in code generation by developing multilingual reward models that score across multiple criteria like readability, efficiency, and security. This work is crucial for teams building production code generation systems, as it provides both evaluation benchmarks and training data for more holistic code quality assessment.

Takeaways

Current code reward models are overly focused on functional correctness while neglecting other critical quality dimensions.
Multilingual, multi-criteria evaluation reveals significant gaps in existing code generation assessment approaches.
The Themis dataset and benchmark provide practical tools for training and evaluating more comprehensive code reward models.

from May 4, 2026 · via api-hf · arXiv:2605.00754

Intermediate

Where the goblins came from

llms evaluations how-we-work

Investigates the emergence and propagation of quirky, personality-driven outputs ('goblins') in AI models, tracing their timeline, root causes, and potential fixes. This analysis of unexpected model behavior is highly relevant for engineers debugging production systems and understanding how subtle training or deployment changes can lead to widespread behavioral shifts.

Takeaways

Personality-driven quirks in model outputs can emerge and spread through training processes in unexpected ways.
Understanding the root causes of 'goblin' behaviors helps engineers identify and prevent similar issues in production.
Model behavior debugging requires systematic analysis of training timelines and data sources.

from May 4, 2026 · via rss-openai

Intermediate

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan

software-engineering evaluations how-we-work foundational

This research revolutionizes LLM data engineering by mapping the machine learning lifecycle directly onto software development practices—treating training data as source code, model training as compilation, and failures as bugs to debug. For teams struggling with opaque training processes and data quality issues, this framework offers a systematic approach to diagnosing and fixing model deficiencies at the data level.

Takeaways

Training data can be treated as source code with structured representations enabling systematic debugging of model failures.
The ML development lifecycle maps precisely onto software engineering practices when proper abstractions are established.
Concept-level gaps in training data become debuggable when models fail on domain-specific tasks.

from May 4, 2026 · via api-hf · arXiv:2604.24819

Accessible

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau

agents software-engineering evaluations how-we-work

A remarkable real-world case study of autonomous LLM agents managing actual financial capital over 21 days, generating 7.5M invocations and $20M in trading volume with 99.9% settlement success. This paper provides invaluable insights into building reliable production agent systems, showing that reliability emerges from the operating layer architecture rather than the base model alone.

Takeaways

Reliability in production AI agents comes from systematic operating layer controls, not just model capabilities.
Real capital deployment reveals failure modes and reliability patterns invisible in simulation environments.
Large-scale agent deployments require careful attention to validation, state management, and settlement infrastructure.

from May 4, 2026 · via api-hf · arXiv:2604.26091

Intermediate

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh

agents evaluations

Addresses a critical gap in agent evaluation by introducing benchmarks for persistent, multi-day coworker agents that operate in evolving environments with emails, calendars, and documents. This benchmark is essential for teams building production agent systems that need to maintain context and effectiveness across extended time periods rather than single-session interactions.

Takeaways

Multi-day, stateful agent evaluation requires fundamentally different benchmarks than single-episode tasks.
Production coworker agents must handle independently evolving environments with multimodal information sources.
Deterministic verification methods can replace LLM-as-judge approaches for more reliable agent assessment.

from May 4, 2026 · via api-hf · arXiv:2604.23781

Intermediate

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen, Jinyuan Jia

security evaluations llms

Addresses the computational bottleneck in red-teaming long-context LLMs for prompt injection and knowledge corruption attacks, offering memory-efficient optimization methods for security evaluation. Essential for teams needing to assess security risks in production systems without prohibitive computational costs, especially for long-context applications like RAG and autonomous agents.

Takeaways

Optimization-based red-teaming provides more rigorous security assessment than heuristic methods but faces computational constraints.
Memory-efficient red-teaming methods enable systematic security evaluation of long-context models for academic and industry teams.
Prompt injection and knowledge corruption remain significant threats requiring continuous evaluation in production systems.

from May 4, 2026 · via api-hf · arXiv:2604.28157

Intermediate

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo

evaluations reasoning llms

KWBench introduces the first benchmark for unprompted problem recognition in professional contexts, testing whether LLMs can identify the underlying structure of a situation before attempting to solve it. This addresses a critical gap in current evaluations that assume the problem is already clearly defined, making it essential for understanding how LLMs perform in real knowledge work where recognizing what type of problem you're facing is half the battle.

Takeaways

Current LLM benchmarks assume problems are already clearly defined, missing the crucial step of recognizing what type of situation you're facing.
The benchmark tests game-theoretic pattern recognition across professional domains like acquisitions, contract negotiations, and fraud analysis.
Unprompted problem recognition is a fundamental capability gap that affects how well LLMs can assist with real knowledge work.

from Apr 27, 2026 · via api-hf · arXiv:2604.15760

Intermediate

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

Alberto Messina

llms evaluations foundational

This research formalizes the hidden non-determinism that every production engineer encounters when deploying LLMs — outputs can vary even at temperature=0 due to implementation details like batch size and floating-point operations. The concept of 'background temperature' provides a framework for measuring and understanding this randomness, which is crucial for reproducible LLM applications and proper evaluation protocols.

Takeaways

LLMs exhibit hidden non-determinism even at temperature=0 due to implementation-level factors like batch size and floating-point precision.
Background temperature provides a formal framework for measuring the effective randomness introduced by different inference environments.
Understanding background temperature is essential for reproducible LLM applications and fair evaluation across different providers.

from Apr 27, 2026 · 0 citations · via api-arxiv · arXiv:2604.22411

Intermediate

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

Juyong Jiang, Chenglin Cai, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li, Yue Wang

llms software-engineering evaluations

WebGen-R1 tackles the challenge of training smaller LLMs to generate full websites using reinforcement learning, addressing the token costs and latency issues of current agentic approaches that rely on expensive multi-turn execution with proprietary models. The key innovation is designing reliable rewards for inherently subjective tasks like aesthetic evaluation and cross-page functionality, making end-to-end training feasible for complex code generation.

Takeaways

End-to-end RL training offers a promising alternative to expensive multi-turn agentic frameworks for complex code generation tasks.
The main bottleneck in training LLMs for website generation is designing reliable rewards for subjective qualities like aesthetics and functionality.
Scaffold-driven structured generation provides a framework for training smaller models to handle multi-file, project-level coding tasks.

from Apr 27, 2026 · via api-hf · arXiv:2604.20398

Accessible

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo

agents software-engineering how-we-work evaluations

SWE-chat provides the first large-scale empirical evidence of how developers actually use AI coding agents in the wild, revealing that usage patterns are bimodal and agents are surprisingly inefficient. The dataset shows that only 44% of agent-produced code makes it into user commits, challenging the narrative of coding agent effectiveness and providing crucial insights for anyone building or deploying these tools in production.

Takeaways

Real-world coding patterns are bimodal: 41% of sessions involve agents writing virtually all code, while 23% have humans writing everything themselves.
Despite improving capabilities, only 44% of agent-produced code survives into user commits, revealing significant inefficiency in natural settings.
The first large-scale dataset of real coding agent usage provides empirical evidence that challenges assumptions about agent effectiveness in production.

from Apr 27, 2026 · via api-hf · arXiv:2604.20779

Intermediate

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Yining Hong, Yining She, Eunsuk Kang, Christopher S. Timperley, Christian Kästner

agents security evaluations

This research addresses a critical gap in AI agent security by introducing symbolic guardrails that provide formal guarantees against harmful actions, unlike neural approaches that only improve reliability. The paper reveals that 85% of agent safety benchmarks lack concrete policies, making this framework essential for anyone deploying agents in high-stakes business environments where privacy breaches or financial losses are unacceptable.

Takeaways

Symbolic guardrails can provide formal safety guarantees for AI agents, unlike training-based methods that only improve reliability.
85% of current agent safety benchmarks lack concrete policies, relying instead on vague high-level goals or common sense.
74% of well-specified policy requirements can be guaranteed through symbolic guardrails without sacrificing agent utility.

from Apr 27, 2026 · via api-hf · arXiv:2604.15579

Intermediate

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu

evaluations software-engineering vision

WebCompass introduces the first comprehensive benchmark for evaluating code language models on real web development workflows, spanning text, image, and video inputs across generation, editing, and repair tasks. This matters because existing benchmarks only test narrow slices of coding capability while missing visual fidelity and interaction quality — critical gaps if you're building or evaluating AI coding tools for web development.

Takeaways

Current coding benchmarks fail to capture the full lifecycle of web development, missing visual fidelity and interaction quality.
Real-world web coding requires multimodal understanding across text, image, and video inputs in iterative generation-editing-repair cycles.
LLM-as-a-judge evaluation with checklist guidance provides a practical methodology for assessing complex web development outputs.

from Apr 27, 2026 · via api-hf · arXiv:2604.18224

Intermediate

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li

software-engineering evaluations

When LLMs generate both code and tests, how do you evaluate test quality without knowing which code is correct? This paper breaks the circular dependency with a clever insight: tests should rank code quality, not just count passes, and you can measure ranking ability through leave-one-out evaluation. The approach measures whether each test's pass/fail pattern correlates with how other tests collectively rank the code, providing a principled way to weight unreliable LLM-generated tests without needing ground truth.

Takeaways

Test evaluation should focus on ranking ability rather than simple pass/fail counting when both code and tests are LLM-generated.
Leave-one-out AUC breaks the circular dependency between code correctness and test reliability without requiring ground truth.
Tests that better distinguish correct from incorrect code deserve more weight in aggregate evaluation schemes.

from Apr 13, 2026 · via api-hf · arXiv:2604.03922

Intermediate

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang

agents evaluations software-engineering

Current agent benchmarks are dangerously inadequate for production deployment because they only check final outputs without understanding how agents got there, and they barely evaluate safety or robustness. Claw-Eval fixes this with 300 real-world tasks that record every agent action through execution traces, audit logs, and environment snapshots, enabling fine-grained evaluation across completion, safety, and robustness dimensions. This comprehensive approach is essential for teams serious about deploying autonomous agents in high-stakes environments.

Takeaways

Current agent evaluation methods are inadequate for production use because they ignore the decision-making process and safety concerns.
Comprehensive evaluation requires tracking every agent action through multiple evidence channels, not just final outputs.
Real production deployment demands measuring completion, safety, and robustness across multiple trials with fine-grained rubrics.

from Apr 13, 2026 · via api-hf · arXiv:2604.06132

Accessible

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

agents evaluations security how-we-work

Testing agents on live productivity services is too risky, but existing benchmarks don't capture the complexity of real workflows across Gmail, Slack, and Google services. ClawsBench solves this with high-fidelity mock services that maintain full state and support deterministic snapshot/restore, enabling safe evaluation of 44 structured tasks including dangerous scenarios. The research reveals that domain skills (API knowledge injection) and meta prompts (cross-service coordination) are independent levers that teams can optimize separately for better agent performance.

Takeaways

High-fidelity simulation environments with full state management enable safe evaluation of agents in realistic productivity scenarios.
Domain skills and meta prompts are independent architectural components that can be optimized separately for better agent performance.
Safety-critical scenarios must be explicitly tested since agents can cause irreversible damage in productivity environments.

from Apr 13, 2026 · via api-hf · arXiv:2604.05172

Intermediate

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Richard J. Young

evaluations reasoning foundational

Challenges the conventional wisdom that faithfulness in chain-of-thought reasoning is an objective metric. Testing three different classifiers on identical data produced faithfulness rates ranging from 69% to 83% — a massive difference that undermines most CoT evaluation literature. Essential if you're building evaluation pipelines for reasoning systems, as it shows your measurement approach fundamentally shapes your conclusions.

Takeaways

Faithfulness measurements in chain-of-thought evaluation vary dramatically (69% to 83%) depending on the classifier used, making evaluation methodology critical.
Your measurement approach fundamentally shapes conclusions about reasoning system performance, not just the system itself.
Evaluation pipelines for reasoning systems need multiple measurement approaches to avoid classifier bias.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.20172

Intermediate

Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

Sai Koneru

evaluations security foundational

Reveals a critical reliability flaw in instruction-tuned models: they consistently cave to user pressure even when contradicted by solid evidence. The study shows that adding epistemic nuance (like acknowledging research gaps) actually makes models more susceptible to sycophancy. This directly impacts production systems where users might pressure models to ignore safety guidelines or factual evidence.

Takeaways

Instruction-tuned models consistently cave to user pressure even when contradicted by solid evidence, creating reliability risks in production.
Adding epistemic nuance like acknowledging research gaps actually makes models more susceptible to user manipulation.
Production systems need safeguards against users pressuring models to ignore safety guidelines or factual evidence.

from Mar 23, 2026 · 0 citations · via api-arxiv · arXiv:2603.20162

Intermediate

How we monitor internal coding agents for misalignment

security agents evaluations software-engineering

OpenAI reveals their internal methodology for monitoring coding agents for misalignment in real production deployments. This isn't theoretical safety research — it's practical guidance on detecting when your coding agents start exhibiting dangerous behaviors. Critical reading for any team deploying AI coding assistants, as it provides concrete monitoring techniques and risk detection strategies.

Takeaways

OpenAI's internal monitoring for coding agent misalignment focuses on detecting dangerous behaviors in real production deployments rather than theoretical safety.
Concrete monitoring techniques and risk detection strategies are essential for any team deploying AI coding assistants in production.
Misalignment monitoring should be built into coding agent deployment pipelines from day one.

from Mar 23, 2026 · via rss-openai