Tag: software-engineering

Intermediate

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Runyuan He, Qiuyang Mang, Shang Zhou, Kaiyuan Liu, Hanchen Li, Huanzhi Mao, Qizheng Zhang, Zerui Li, Bo Peng, Lufeng Cheng, Tianfu Fu, Yichuan Wang, Wenhao Chai, Jingbo Shang, Alex Dimakis, Joseph E. Gonzalez, Alvin Cheung

This addresses a critical bottleneck in training better coding agents—the scarcity of open-ended programming problems that mirror real-world development challenges. FrontierSmith automatically evolves competitive programming problems into open-ended variants that elicit diverse solution approaches. Essential for understanding how to improve AI coding capabilities beyond the current focus on well-defined tasks like bug fixes and feature implementation.

Takeaways

Open-ended coding problems are essential for training LLMs that can handle real-world development challenges.
Automated synthesis can scale creation of diverse coding problems that elicit genuinely different solution approaches.
Current LLM coding training focuses too heavily on well-defined tasks versus the ambiguous problems developers actually face.

from May 18, 2026 · via api-hf · arXiv:2605.14445

Accessible

Not so locked in any more

software-engineering how-we-work opinion

This captures a profound shift in software engineering economics—AI coding agents are eliminating traditional language and platform lock-in by making rewrites economically feasible. The example of a company using coding agents to migrate legacy iPhone/Android apps to React Native illustrates how AI changes the cost-benefit calculus of maintaining separate codebases. This has massive implications for technology choices and technical debt management.

Takeaways

AI coding agents are reducing the economic barriers to cross-platform migrations and rewrites.
Traditional platform lock-in becomes less relevant when AI can handle the tedious work of code translation.
Strategic technology decisions need to account for dramatically lower migration costs in an AI-augmented world.

from May 18, 2026 · via rss-willison

Accessible

Why senior developers fail to communicate their expertise

software-engineering how-we-work opinion

This challenges the conventional wisdom that technical expertise alone makes senior developers valuable in the AI era. The author argues that senior developers instinctively focus on technical complexity while business stakeholders worry about uncertainty—a communication gap that becomes critical when AI can handle much of the complexity but amplifies the uncertainty. If you're a senior engineer wondering how to stay relevant, this reframes the conversation entirely.

Takeaways

Senior developers must shift from communicating complexity to addressing business uncertainty in AI-augmented workflows.
Traditional technical communication patterns become counterproductive when AI handles routine complexity.
The most valuable senior developers will be those who can translate between AI capabilities and business outcomes.

from May 18, 2026 · via manual

Intermediate

HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

Dongming Jiang, Yi Li, Guanpeng Li, Qiannan Li, Bingzhe Li

agents llms foundational software-engineering

Finally, a serious approach to agent memory that goes beyond naive vector search. HAGE reconceptualizes memory retrieval as query-conditioned graph traversal, where relationships have varying strength and confidence. This matters because most production agent systems still rely on flat retrieval that ignores the complex, context-dependent nature of how information should be connected and weighted. If you're building stateful agents, this provides a blueprint for sophisticated memory architectures.

Takeaways

Agent memory should be organized as weighted multi-relational graphs rather than flat vector stores.
Query-conditioned traversal enables more sophisticated retrieval than static similarity search.
Trainable relation features allow memory systems to adapt to different types of queries and contexts.

from May 18, 2026 · 0 citations · via api-hf · arXiv:2605.09942

Intermediate

Key-Value Means

Daniel Goldstein, Eugene Cheah

foundational software-engineering

Key-Value Means offers a practical solution to the fundamental memory bottleneck in transformers without requiring custom kernels. It provides O(N) chunked processing with sublinear memory growth while maintaining the parallelizable training benefits of standard transformers. This is immediately relevant for production systems dealing with long contexts where KV-cache memory becomes the limiting factor.

Takeaways

KVM provides a unified solution combining benefits of transformers and linear RNNs without custom kernel requirements.
The approach enables continuous trade-offs between memory usage and computational complexity in production systems.
Sublinear state growth makes long-context applications economically feasible at scale.

from May 18, 2026 · via api-hf · arXiv:2605.09877

Intermediate

Harness engineering: leveraging Codex in an agent-first world

agents software-engineering how-we-work

Essential reading for anyone building agent-first development workflows. Lopopolo shares practical insights from Codex implementation that challenge conventional wisdom about how AI should integrate into software engineering processes. This isn't another theoretical piece—it's a practitioner's guide to harnessing AI agents in real development environments where traditional tooling falls short.

Takeaways

Agent-first workflows require fundamentally different architectural thinking than traditional AI-assisted development.
Codex integration succeeds when it becomes the primary interface rather than a secondary tool.
Production agent systems need careful harness engineering to bridge the gap between AI capabilities and developer workflows.

from May 18, 2026 · via manual

Intermediate

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

Ruijie Zhou, Fanxu Meng, Yufei Xu, Tongxuan Liu, Guangming Lu, Muhan Zhang, Wenjie Pei

llms software-engineering

A drop-in optimization for sparse attention that cuts computational costs on long contexts by treating attention heads as mixture-of-experts, using cheap block-level statistics to route queries to only a few relevant heads instead of scoring every token with every head. This is immediately practical for production systems dealing with long-context inference, offering significant speedups while preserving the expressiveness of the original attention mechanism.

Takeaways

Sparse attention indexing costs can be dramatically reduced using mixture-of-experts routing.
Block-level statistics provide sufficient information for efficient head selection.
The optimization preserves attention quality while offering substantial computational savings.

from May 11, 2026 · via api-hf · arXiv:2605.07363

Accessible

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Siddhant Saxena, Nilesh Trivedi, Vinayaka Jyothi

software-engineering evaluations agents how-we-work

The first comprehensive evaluation framework for AI coding platforms that treats them as virtual software agencies rather than just code generators. The 68-metric evaluation across product management, engineering, and operations reveals four critical shortcomings in current platforms: specification bottlenecks, architectural blind spots, iteration fragility, and business readiness gaps—essential insights for anyone building or evaluating AI development tools.

Takeaways

AI coding platforms need evaluation beyond code quality to include product management and operations capabilities.
Current platforms struggle with specification understanding, architectural decisions, and iterative development.
Business readiness requires capabilities spanning multiple roles, not just engineering output.

from May 11, 2026 · via api-hf · arXiv:2605.04637

Accessible

James Shore: You Need AI That Reduces Maintenance Costs

software-engineering

James Shore argues that the real value of AI tools lies not in initial development speed but in reducing long-term maintenance costs—the largest expense in most software projects. This challenges the common focus on AI coding assistants for feature development and suggests we should evaluate AI tools based on whether they create more maintainable, debuggable, and extensible code.

Takeaways

AI's value should be measured by maintenance cost reduction, not development speed.
Focus on whether AI tools create more maintainable code rather than faster initial development.
Long-term code quality matters more than short-term productivity gains.

from May 11, 2026 · via manual

Accessible

Appearing Productive in The Workplace — No One

how-we-work opinion software-engineering

This challenges the conventional wisdom that AI-generated code is obviously detectable by experienced engineers. The author argues that AI can now produce work that passes expert review while containing fundamental flaws that only surface later in production, creating two dangerous failure modes: code that looks professional but lacks deep understanding, and teams that become dependent on AI output they can't properly evaluate.

Takeaways

AI-generated work can fool experienced reviewers by appearing expert without actually being expert.
The failure modes are both immediate (bad code getting through) and systemic (teams losing evaluation skills).
Traditional code review processes may be insufficient for AI-assisted development.

from May 11, 2026 · via manual

Intermediate

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul, Glavaš Glavas, Iryna Gurevych

llms evaluations software-engineering

Challenges the narrow focus on functional correctness in code generation by developing multilingual reward models that score across multiple criteria like readability, efficiency, and security. This work is crucial for teams building production code generation systems, as it provides both evaluation benchmarks and training data for more holistic code quality assessment.

Takeaways

Current code reward models are overly focused on functional correctness while neglecting other critical quality dimensions.
Multilingual, multi-criteria evaluation reveals significant gaps in existing code generation assessment approaches.
The Themis dataset and benchmark provide practical tools for training and evaluating more comprehensive code reward models.

from May 4, 2026 · via api-hf · arXiv:2605.00754

Intermediate

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan

software-engineering evaluations how-we-work foundational

This research revolutionizes LLM data engineering by mapping the machine learning lifecycle directly onto software development practices—treating training data as source code, model training as compilation, and failures as bugs to debug. For teams struggling with opaque training processes and data quality issues, this framework offers a systematic approach to diagnosing and fixing model deficiencies at the data level.

Takeaways

Training data can be treated as source code with structured representations enabling systematic debugging of model failures.
The ML development lifecycle maps precisely onto software engineering practices when proper abstractions are established.
Concept-level gaps in training data become debuggable when models fail on domain-specific tasks.

from May 4, 2026 · via api-hf · arXiv:2604.24819

Accessible

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau

agents software-engineering evaluations how-we-work

A remarkable real-world case study of autonomous LLM agents managing actual financial capital over 21 days, generating 7.5M invocations and $20M in trading volume with 99.9% settlement success. This paper provides invaluable insights into building reliable production agent systems, showing that reliability emerges from the operating layer architecture rather than the base model alone.

Takeaways

Reliability in production AI agents comes from systematic operating layer controls, not just model capabilities.
Real capital deployment reveals failure modes and reliability patterns invisible in simulation environments.
Large-scale agent deployments require careful attention to validation, state management, and settlement infrastructure.

from May 4, 2026 · via api-hf · arXiv:2604.26091

Intermediate

The Last Harness You'll Ever Build

Haebin Seong, Li Yin, Haoran Zhang

agents software-engineering how-we-work

Presents an evolutionary framework that automates the painful process of building agent harnesses for new domains, using adversarial evaluation and iterative refinement to optimize prompts, tools, and orchestration logic. This directly tackles one of the biggest bottlenecks in production AI systems—the manual engineering required to make foundation models effective for specific enterprise workflows.

Takeaways

Agent harness engineering can be automated through evolutionary optimization with adversarial evaluation feedback.
The meta-evolution loop concept enables systems to improve their own optimization processes over time.
Automated harness creation could dramatically reduce the engineering overhead of deploying agents in new domains.

from May 4, 2026 · via api-hf · arXiv:2604.21003

Intermediate

The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu, Jiaxin Pei, Jintao Huang, Chenglei Si, Ao Qu, Xiangru Tang, Runyu Lu, Lichang Chen, Xiaoyan Bai, Haizhong Zheng, Carl Chen, Zhiyang Chen, Haojie Ye, Yujuan Fu, Zexue He, Zijian Jin, Zhenyu Zhang, Shangquan Sun, Maestro Harmon, John Dianzhuo Wang, Jianqiao Zeng, Jiachen Sun, Mingyuan Wu, Baoyu Zhou, Chenyu You, Shijian Lu, Yiming Qiu, Fan Lai, Yuan Yuan, Yao Li, Junyuan Hong, Ruihao Zhu, Beidi Chen, Alex Pentland, Ang Chen, Mosharaf Chowdhury, Zechen Zhang

foundational agents software-engineering opinion

Proposes a radical reimagining of research artifacts as machine-executable packages that preserve the full exploration process, including failures and implementation details that traditional papers discard. For teams building AI agents that need to understand and extend existing work, this framework offers a path toward truly reproducible and agent-consumable research.

Takeaways

Traditional research papers impose storytelling and engineering taxes that make them unsuitable for AI agents to consume and extend.
Agent-native artifacts should preserve the full exploration graph including failed experiments and rejected hypotheses.
Machine-executable research packages can bridge the gap between human-readable findings and agent-actionable specifications.

from May 4, 2026 · via api-hf · arXiv:2604.24658

Intermediate

Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets

Harshit Joshi, Priyank Shethia, Jadelynn Dao, Monica S. Lam

rag reasoning llms software-engineering

SLIDERS challenges the conventional chunk-and-aggregate approach to document QA by extracting information into a relational database and reasoning with SQL instead of concatenated text. This architectural approach sidesteps the fundamental limitation that any fixed context window will eventually be exceeded, making it essential reading for engineers building document analysis systems that need to scale beyond typical RAG limitations.

Takeaways

Traditional chunk-and-aggregate approaches hit an aggregation bottleneck as document collections grow, even with infinite context windows.
Extracting information into structured databases and reasoning with SQL scales better than reasoning over concatenated text.
Data reconciliation using provenance and extraction rationales is crucial for maintaining coherence in locally extracted information.

from Apr 27, 2026 · via api-hf · arXiv:2604.22294

Intermediate

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

Juyong Jiang, Chenglin Cai, Chansung Park, Jiasi Shen, Sunghun Kim, Jianguo Li, Yue Wang

llms software-engineering evaluations

WebGen-R1 tackles the challenge of training smaller LLMs to generate full websites using reinforcement learning, addressing the token costs and latency issues of current agentic approaches that rely on expensive multi-turn execution with proprietary models. The key innovation is designing reliable rewards for inherently subjective tasks like aesthetic evaluation and cross-page functionality, making end-to-end training feasible for complex code generation.

Takeaways

End-to-end RL training offers a promising alternative to expensive multi-turn agentic frameworks for complex code generation tasks.
The main bottleneck in training LLMs for website generation is designing reliable rewards for subjective qualities like aesthetics and functionality.
Scaffold-driven structured generation provides a framework for training smaller models to handle multi-file, project-level coding tasks.

from Apr 27, 2026 · via api-hf · arXiv:2604.20398

Intermediate

AgentSPEX: An Agent SPecification and EXecution Language

Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang

agents software-engineering

AgentSPEX introduces a declarative language for specifying LLM agent workflows with explicit control flow, addressing the maintainability nightmare of workflow logic tightly coupled to Python code in current frameworks like LangGraph and CrewAI. This matters because reactive prompting makes agent behavior unpredictable, while existing orchestration frameworks create maintenance headaches as workflows grow complex.

Takeaways

Current agent frameworks tightly couple workflow logic with Python code, making agents difficult to maintain as they grow complex.
Explicit control flow with typed steps, branching, and state management provides better structure than reactive prompting approaches.
Separating workflow specification from execution environment enables better tooling, verification, and collaborative development of agent systems.

from Apr 27, 2026 · via api-hf · arXiv:2604.13346

Accessible

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo

agents software-engineering how-we-work evaluations

SWE-chat provides the first large-scale empirical evidence of how developers actually use AI coding agents in the wild, revealing that usage patterns are bimodal and agents are surprisingly inefficient. The dataset shows that only 44% of agent-produced code makes it into user commits, challenging the narrative of coding agent effectiveness and providing crucial insights for anyone building or deploying these tools in production.

Takeaways

Real-world coding patterns are bimodal: 41% of sessions involve agents writing virtually all code, while 23% have humans writing everything themselves.
Despite improving capabilities, only 44% of agent-produced code survives into user commits, revealing significant inefficiency in natural settings.
The first large-scale dataset of real coding agent usage provides empirical evidence that challenges assumptions about agent effectiveness in production.

from Apr 27, 2026 · via api-hf · arXiv:2604.20779

Intermediate

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu

evaluations software-engineering vision

WebCompass introduces the first comprehensive benchmark for evaluating code language models on real web development workflows, spanning text, image, and video inputs across generation, editing, and repair tasks. This matters because existing benchmarks only test narrow slices of coding capability while missing visual fidelity and interaction quality — critical gaps if you're building or evaluating AI coding tools for web development.

Takeaways

Current coding benchmarks fail to capture the full lifecycle of web development, missing visual fidelity and interaction quality.
Real-world web coding requires multimodal understanding across text, image, and video inputs in iterative generation-editing-repair cycles.
LLM-as-a-judge evaluation with checklist guidance provides a practical methodology for assessing complex web development outputs.

from Apr 27, 2026 · via api-hf · arXiv:2604.18224

Intermediate

Quo Vadis, Code Review? Exploring the Future of Code Review

software-engineering how-we-work

A survey of 100 developers across five companies reveals how AI automation is reshaping code review practices while the fundamentals remain essential. The research shows that practitioners expect code review to stay critical but anticipate significant changes in what gets reviewed and how much time it takes. This matters because understanding these trends helps teams adapt their review processes and tooling investments as AI-assisted development becomes mainstream.

Takeaways

Developers expect code review to remain essential despite increasing AI automation in development workflows.
The scope and time investment in code review are expected to shift significantly over the next five years as AI tools mature.
Teams need to proactively adapt review processes and tooling strategies to work effectively with AI-assisted development.

from Apr 27, 2026 · via manual · arXiv:2508.06879

Intermediate

The AI engineering stack we built internally — on the platform we ship

software-engineering how-we-work llms

Cloudflare shares real metrics from running their own AI engineering stack in production, processing 241 billion tokens and serving 3,683 internal users. This is essential reading if you're building AI infrastructure — they dogfood their own products (AI Gateway, Workers AI) and provide actual numbers on throughput, costs, and architectural decisions. The post challenges the common wisdom of building separate dev/prod AI stacks by showing how running on your own platform reveals critical performance and scalability insights.

Takeaways

Running AI infrastructure on the same platform you ship reveals hidden performance bottlenecks and helps prioritize product improvements.
Processing 241 billion tokens across 20 million requests provides concrete scale benchmarks for AI Gateway architecture decisions.
Dogfooding AI products with thousands of internal users uncovers real-world usage patterns that synthetic benchmarks miss.

from Apr 27, 2026 · via manual

Intermediate

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun, Wenran Liu, Kai Chen, Yining Li

agents software-engineering how-we-work

TREX automates the entire LLM fine-tuning pipeline through multi-agent collaboration, from literature research to data preparation to model evaluation. This challenges the current reality where fine-tuning requires extensive manual orchestration by ML engineers, offering a glimpse into fully automated ML workflows that could democratize model customization for domain-specific applications.

Takeaways

Multi-agent systems can automate complex ML workflows beyond individual tasks, handling entire fine-tuning lifecycles.
Modeling the experimental process as a search tree enables efficient exploration and reuse of historical training results.
Automated fine-tuning could significantly reduce the expertise barrier for domain-specific LLM customization.

from Apr 20, 2026 · via api-hf · arXiv:2604.14116

Intermediate

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Yiqun Sun, Pengfei Wei, Lawrence B. Hsieh

rag agents software-engineering

Corpus2Skill fundamentally reimagines RAG by giving AI agents a navigable map of your knowledge base instead of treating them as passive consumers of search results. Rather than hoping retrieval finds the right documents, agents can see the corpus structure, drill down through hierarchical summaries, and strategically combine evidence across different branches—solving the core limitation that RAG systems can't reason about what they haven't seen.

Takeaways

Traditional RAG limits AI agents to passive consumption of search results without visibility into corpus structure or unexplored areas.
Hierarchical skill directories enable agents to navigate knowledge strategically and combine evidence across different topic branches.
Offline corpus compilation into navigable structures provides better performance than runtime retrieval-only approaches.

from Apr 20, 2026 · via api-hf · arXiv:2604.14572

Intermediate

AEGIS: Anchor-Enforced Gradient Isolation for Knowledge-Preserving Vision-Language-Action Fine-Tuning

Guransh Singh

llms vision software-engineering

AEGIS solves the critical problem of fine-tuning vision-language models for robotics without destroying their original capabilities. Current approaches either throw away valuable continuous supervision or use LoRA adapters that still overwrite pre-trained knowledge, but AEGIS uses orthogonal gradient projection to enable direct continuous learning while preserving the model's existing visual-question-answering abilities.

Takeaways

Fine-tuning VLMs for robotics typically destroys original capabilities due to gradient asymmetry between continuous control and discrete language training.
Orthogonal gradient projection enables continuous learning while preserving pre-trained manifolds better than LoRA or stop-gradient approaches.
The framework addresses the spectral mismatch between low-rank regression gradients and high-dimensional semantic representations.

from Apr 20, 2026 · via api-arxiv · arXiv:2604.16067

Accessible

Steve Yegge

how-we-work agents software-engineering opinion

Yegge's conversation reveals that even Google's engineering teams follow the same AI adoption pattern as traditional companies: 20% power users building with agents, 20% refusing AI tools entirely, and 60% stuck using basic chat interfaces like Cursor. This insight challenges assumptions about tech giants being ahead on internal AI adoption and suggests most organizations are at similar maturity levels regardless of their AI product offerings.

Takeaways

Google's internal AI adoption mirrors traditional companies despite their advanced AI research and products.
The industry-wide pattern shows 60% of engineers still using basic chat tools rather than advanced agentic workflows.
Having cutting-edge AI products doesn't necessarily translate to advanced internal adoption within engineering teams.

from Apr 20, 2026 · 0 citations · via rss-willison

Accessible

The Claude Coding Vibes Are Getting Worse

llms software-engineering opinion

A practitioner's firsthand account of Claude's coding capabilities deteriorating over recent months, with Opus 4.7 marking a particularly noticeable decline in code quality and user experience. This represents the kind of model drift that production teams using AI coding assistants need to monitor and plan for, as capabilities can regress without warning across model updates.

Takeaways

AI coding assistant capabilities can degrade over time through model updates, requiring continuous monitoring in production environments.
Recent Claude releases show measurable declines in coding quality according to experienced users.
Teams should plan for potential capability regressions when building dependencies on AI coding tools.

from Apr 20, 2026 · via manual

Intermediate

Design and code inspections to reduce errors in program development

M. E. Fagan

software-engineering foundational

This seminal 1976 IBM paper established formal code inspection processes that remain surprisingly relevant in the AI-assisted development era. As teams increasingly rely on AI-generated code, the systematic verification processes and error categorization methods described here become even more critical for maintaining code quality and catching subtle bugs that AI tools might miss or introduce.

Takeaways

Formal inspection processes with defined participant roles can substantially improve programming quality and productivity.
Systematic error categorization and measurement enable continuous process improvement and ever-improving error rates.
The inspection methodology provides a framework for quality control that remains relevant for AI-generated code verification.

from Apr 20, 2026 · via manual

Accessible

Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

Huacan Wang, Jie Zhou, Ningyan Zhu, Shuo Zhang, Feiyu Chen, Jiarou Wu, Ge Chen, Chen Liu, Wangyi Chen, Xiaofeng Mou, Yi Xu

software-engineering agents how-we-work

Sema Code tackles the enterprise reality that every AI coding solution locks you into their specific interface, making it impossible to reuse AI capabilities across different development environments. Their embeddable architecture decouples the AI reasoning engine from delivery mechanisms, letting teams integrate the same AI coding capabilities into CLIs, IDEs, web apps, or custom toolchains without rebuilding from scratch.

Takeaways

Current AI coding solutions create vendor lock-in by coupling reasoning capabilities with specific delivery interfaces.
Decoupling the AI engine into a standalone library enables reuse across heterogeneous engineering environments.
The framework addresses enterprise needs like multi-tenancy, session management, and permission control that are missing from consumer AI coding tools.

from Apr 20, 2026 · via api-hf · arXiv:2604.11045

Intermediate

SkVM: Compiling Skills for Efficient Execution Everywhere

Le Chen, Erhu Feng, Yubin Xia, Haibo Chen

agents software-engineering foundational

SkVM addresses the critical problem that AI agent "skills" behave inconsistently across different platforms because they're treated as raw prompts rather than compiled code. By applying traditional compiler techniques to LLM skills—measuring model capabilities, performing capability-based compilation, and enabling runtime optimization—this system makes agent skills truly portable and efficient across different model-harness combinations.

Takeaways

Treating AI agent skills as compilable code rather than raw prompts enables consistent behavior across different platforms.
Capability profiling of model-harness pairs allows for targeted compilation and optimization of skill execution.
JIT compilation and adaptive recompilation techniques can significantly improve agent skill performance at runtime.

from Apr 20, 2026 · via api-hf · arXiv:2604.03088

Intermediate

Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

software-engineering llms how-we-work

Move over prompt engineering—harness engineering is the new frontier for building production LLM systems at massive scale. This deep dive from OpenAI's Ryan Lopopolo reveals how teams operating at token-billionaire scale (1B tokens/day) architect systems with millions of lines of code generated without human review. The focus shifts from optimizing individual prompts to engineering the entire infrastructure that channels LLM capabilities into reliable, scalable production systems.

Takeaways

At massive scale, engineering the infrastructure around LLMs matters more than optimizing individual prompts.
Production systems generating millions of lines of code daily require fundamentally different architectural approaches.
Token billionaire scale operations demand new engineering disciplines focused on harness systems rather than model tuning.

from Apr 13, 2026 · via rss-latentspace

Intermediate

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li

software-engineering evaluations

When LLMs generate both code and tests, how do you evaluate test quality without knowing which code is correct? This paper breaks the circular dependency with a clever insight: tests should rank code quality, not just count passes, and you can measure ranking ability through leave-one-out evaluation. The approach measures whether each test's pass/fail pattern correlates with how other tests collectively rank the code, providing a principled way to weight unreliable LLM-generated tests without needing ground truth.

Takeaways

Test evaluation should focus on ranking ability rather than simple pass/fail counting when both code and tests are LLM-generated.
Leave-one-out AUC breaks the circular dependency between code correctness and test reliability without requiring ground truth.
Tests that better distinguish correct from incorrect code deserve more weight in aggregate evaluation schemes.

from Apr 13, 2026 · via api-hf · arXiv:2604.03922

Intermediate

Neural Computers

Mingchen Zhuge, Changsheng Zhao, Haozhe Liu, Zijian Zhou, Shuming Liu, Wenyi Wang, Ernie Chang, Gael Le Lan, Junjie Fei, Wenxuan Zhang, Yasheng Sun, Zhipeng Cai, Zechun Liu, Yunyang Xiong, Yining Yang, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber

foundational agents software-engineering

This proposes a radical paradigm shift where models don't just generate code or control external systems—they become the execution environment itself, unifying computation, memory, and I/O in learned runtime state. Neural Computers learn to execute programs by watching I/O traces and can potentially be reprogrammed through natural language rather than traditional coding. While early-stage, this vision could fundamentally reshape how we build AI systems by eliminating the boundary between model and runtime environment.

Takeaways

Neural Computers eliminate the distinction between model and execution environment by making the model itself the running computer.
Early implementations can learn interface primitives and basic execution patterns from I/O traces alone.
This paradigm shift could enable natural language reprogramming of computational systems without traditional coding interfaces.

from Apr 13, 2026 · via api-hf · arXiv:2604.06425

Intermediate

Self-Execution Simulation Improves Coding Models

Gallil Maimon, Ori Yoran, Felix Kreuk, Michael Hassid, Gal Cohen, Pierre Chambon, Yossi Adi

llms software-engineering reasoning foundational

Code LLMs struggle because they can't accurately predict what their generated code will do when executed, leading to logical errors that escape syntax checking. This research trains models to simulate program execution step-by-step, enabling self-verification and iterative debugging of their own code. The approach combines supervised learning on execution traces with reinforcement learning, achieving significant improvements on competitive programming benchmarks and providing a foundation for more reliable AI coding assistants.

Takeaways

Teaching models to simulate execution enables self-verification and iterative debugging of generated code.
Combining execution simulation training with reinforcement learning significantly improves competitive programming performance.
Step-by-step execution traces provide grounding that helps models understand and debug their logical reasoning in code.

from Apr 13, 2026 · via api-hf · arXiv:2604.03253

Intermediate

Embarrassingly Simple Self-Distillation Improves Code Generation

llms software-engineering foundational

This challenges the conventional wisdom that you need external verification or teacher models to improve code generation—instead, models can learn from their own outputs using simple self-distillation. The technique improved a 30B model's performance from 42% to 55% on challenging coding problems by sampling solutions at specific temperatures and fine-tuning on them. The key insight is that this reshapes how models balance precision versus exploration in a context-dependent way, making it a practical post-training technique for enhancing coding assistants.

Takeaways

Models can significantly improve at code generation using only their own outputs, without external verification or teacher models.
Simple self-distillation resolves the precision-exploration conflict by context-dependently reshaping token distributions.
The technique shows consistent gains across model sizes and families, making it broadly applicable for improving coding assistants.

from Apr 13, 2026 · via manual

Intermediate

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang

agents evaluations software-engineering

Current agent benchmarks are dangerously inadequate for production deployment because they only check final outputs without understanding how agents got there, and they barely evaluate safety or robustness. Claw-Eval fixes this with 300 real-world tasks that record every agent action through execution traces, audit logs, and environment snapshots, enabling fine-grained evaluation across completion, safety, and robustness dimensions. This comprehensive approach is essential for teams serious about deploying autonomous agents in high-stakes environments.

Takeaways

Current agent evaluation methods are inadequate for production use because they ignore the decision-making process and safety concerns.
Comprehensive evaluation requires tracking every agent action through multiple evidence channels, not just final outputs.
Real production deployment demands measuring completion, safety, and robustness across multiple trials with fine-grained rubrics.

from Apr 13, 2026 · via api-hf · arXiv:2604.06132

Intermediate

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

Devakh Rashie, Veda Rashi

security agents software-engineering

Financial services face an existential problem: probabilistic LLMs operating in domains requiring absolute compliance guarantees, and existing guardrails are fundamentally inadequate for complex regulatory constraints. This paper presents a breakthrough using Lean 4 theorem proving to treat every AI action as a mathematical conjecture—execution only proceeds if the system can formally prove regulatory compliance. While the approach targets financial services, the formal verification framework could revolutionize how we build deterministic guardrails for any high-stakes AI system.

Takeaways

Probabilistic guardrails are fundamentally inadequate for regulated industries that demand mathematical certainty of compliance.
Formal theorem proving can provide deterministic guarantees by treating every AI action as a provable mathematical conjecture.
Auto-formalizing policies into verifiable code bridges the gap between human regulations and machine-enforceable constraints.

from Apr 13, 2026 · 0 citations · via api-hf · arXiv:2604.01483

Accessible

From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI

software-engineering how-we-work opinion foundational

As teams increasingly rely on AI to accelerate development, this framework warns that we're accumulating dangerous new forms of debt beyond just technical debt. Cognitive debt occurs when teams lose shared understanding of their systems as AI generates code faster than they can comprehend it, while intent debt refers to the missing documentation of why decisions were made—critical context that both humans and AI agents need to safely evolve code. This triple debt model provides a essential lens for evaluating software health in the AI era.

Takeaways

Cognitive debt erodes team understanding as AI generates code faster than teams can internalize it, creating dangerous knowledge gaps.
Intent debt—missing rationale and constraints—becomes critical when AI agents need explicit context to safely modify code.
Traditional technical debt metrics miss these human and knowledge-based risks that dominate in AI-assisted development.

from Apr 13, 2026 · via manual

Intermediate

Components of A Coding Agent

agents software-engineering llms

Essential reading if you're architecting coding agents for production use. This breaks down the core components that make LLMs effective at code generation: sophisticated tool integration, persistent memory systems that maintain context across interactions, and repository-aware context management that helps models understand large codebases. The practical focus on how these pieces work together makes this invaluable for teams moving beyond simple code completion to full coding assistance.

Takeaways

Effective coding agents require sophisticated tool integration beyond simple code completion.
Memory systems that persist context across sessions are crucial for maintaining coherent development workflows.
Repository-aware context management enables agents to understand and work with large, complex codebases.

from Apr 13, 2026 · via manual

Accessible

Ask HN: Client took over development by vibe coding. What to do?

piscator

software-engineering how-we-work opinion

A developer's experience with a client who embraced "vibe coding" with Claude Code, making rapid changes without proper planning or architecture consideration. This highlights the tension between AI-enabled development speed and traditional software engineering discipline, raising important questions about maintaining code quality and project management when AI makes coding feel effortless.

Takeaways

AI coding tools can enable rapid development that bypasses important planning and architecture phases.
"Vibe coding" with AI can create technical debt and project management challenges despite apparent productivity gains.
Professional development workflows need to adapt to balance AI speed with engineering discipline.

from Apr 6, 2026 · 61 points on HN · via api-hn

Accessible

Tell HN: Anthropic no longer allowing Claude Code subscriptions to use OpenClaw

firloop

llms software-engineering how-we-work

Anthropic's policy change affecting third-party tools like OpenClaw represents a significant shift in how developers can access Claude's capabilities outside official interfaces. This impacts teams that have built workflows around unofficial Claude integrations and highlights the business risks of depending on third-party API access patterns. Important for understanding the evolving landscape of AI tool accessibility.

Takeaways

Third-party Claude integrations now require separate pay-as-you-go billing beyond subscription limits.
Teams using unofficial Claude tools need to evaluate cost implications and migration strategies.
The change reflects tightening control over AI model access as these tools become more strategically important.

from Apr 6, 2026 · 1079 points on HN · via api-hn

Intermediate

Show HN: Gemma Gem – AI model embedded in a browser – no API keys, no cloud

ikessler

agents llms software-engineering open-source

This Chrome extension demonstrates practical browser-based AI deployment by embedding Google's Gemma 4 model locally via WebGPU, complete with webpage interaction capabilities like clicking, typing, and JavaScript execution. It proves that sophisticated AI agents can run entirely client-side without API dependencies, opening new possibilities for privacy-preserving AI tools. The implementation shows how to build truly local AI agents with real-world utility.

Takeaways

WebGPU enables running 2B parameter models entirely in the browser without cloud dependencies.
Local AI agents can interact with web pages through tool calling while preserving user privacy.
Browser-based AI deployment eliminates API costs and latency while maintaining reasonable functionality.

from Apr 6, 2026 · 100 points on HN · via api-hn

Intermediate

Eight years of wanting, three months of building with AI

agents software-engineering how-we-work foundational

A compelling case study of how AI agents transformed an eight-year software vision into reality in just three months, specifically building comprehensive SQLite development tools. The author provides detailed insights into agentic engineering workflows and how AI can tackle complex, long-deferred projects that seemed too daunting for traditional development approaches. This demonstrates the paradigm shift from AI as a coding assistant to AI as a capable engineering partner.

Takeaways

AI agents can make previously intractable personal projects suddenly feasible by handling complex implementation details.
Agentic engineering workflows enable rapid prototyping of sophisticated developer tools that would take months using traditional methods.
The key to successful AI-assisted development is clearly defining goals while letting agents handle implementation complexity.

from Apr 6, 2026 · via rss-willison

Intermediate

Can JavaScript Escape a CSP Meta Tag Inside an Iframe?

security software-engineering

Practical security research motivated by building Claude Artifacts-style features, investigating whether Content Security Policy meta tags can effectively sandbox JavaScript in iframes without requiring separate domains. The findings show that CSP meta tags injected at the top of iframe content remain effective even against subsequent JavaScript manipulation attempts. Directly actionable for engineers building AI applications that execute user-generated or AI-generated code.

Takeaways

CSP meta tags in iframe content provide effective sandboxing without requiring separate domains for hosting.
JavaScript cannot manipulate CSP restrictions that were set via meta tags earlier in the document.
This technique enables safer execution of AI-generated code in web applications.

from Apr 6, 2026 · via rss-willison

Accessible

Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

software-engineering how-we-work llms

This research challenges the assumption that AI coding tools work equally well on all codebases by showing that existing code quality metrics predict how reliably LLMs can refactor code without breaking it. Teams can use metrics like CodeHealth to identify where AI assistance is safer to deploy and where human oversight is critical. Essential reading for engineering leaders planning AI tool rollouts — it turns out investing in code maintainability isn't just about helping humans, it's about preparing your codebase for AI.

Takeaways

Human-friendly code quality metrics like CodeHealth strongly correlate with AI refactoring success rates.
Teams can proactively identify high-risk areas for AI intervention using existing code quality tools.
Investing in code maintainability pays dividends for both human developers and AI tooling effectiveness.

from Apr 6, 2026 · via manual

Accessible

Falling For Claude

llms software-engineering how-we-work

A candid reflection on how always-available AI coding assistants like Claude can blur work-life boundaries in unexpected ways. The author explores the psychological and practical implications of having a tireless coding companion that makes it tempting to work at all hours. Important perspective for engineers and managers thinking about sustainable AI adoption practices.

Takeaways

AI coding assistants can create unhealthy work patterns by making development feel frictionless at any time.
The always-available nature of AI tools requires intentional boundaries to maintain work-life balance.

from Apr 6, 2026 · via manual

Intermediate

We Rewrote JSONata with AI in a Day, Saved $500K/Year

software-engineering how-we-work llms

A compelling case study of 'vibe porting' — using AI to rewrite JSONata in Go guided by the existing test suite, achieving significant cost savings in just 7 hours and $400 of API costs. This demonstrates a practical methodology for AI-assisted rewrites: leverage comprehensive tests as guardrails and let AI handle the mechanical translation work.

Takeaways

Comprehensive test suites enable reliable AI-powered porting between languages with minimal human oversight.
Vibe porting can deliver substantial business value ($500K annual savings) when applied to performance-critical components.
The methodology scales: 7 hours of AI-assisted development replaced what would have been months of manual rewriting.

from Mar 29, 2026 · via rss-willison

Accessible

If you don't opt out by Apr 24 GitHub will train on your private repos

vmg12

security software-engineering how-we-work

GitHub is automatically opting users into training Copilot on private repositories unless they explicitly opt out by April 24th — a significant policy change that could expose proprietary code to AI training. This represents a major shift in how code hosting platforms treat private repositories and requires immediate action from teams concerned about code privacy.

Takeaways

GitHub's default opt-in policy for private repo training changes the privacy expectations for enterprise code.
Teams need to audit their GitHub settings immediately to prevent proprietary code from entering AI training datasets.

from Mar 29, 2026 · 719 points on HN · via api-hn

Intermediate

Thoughts on slowing the fuck down

agents software-engineering opinion how-we-work

The creator of Pi agent framework delivers a sharp critique of current AI-assisted development practices, arguing that the rush to generate code quickly is eroding engineering discipline and creating unsustainable technical debt. His core thesis: agent mistakes accumulate faster than human mistakes, making the 'move fast' approach particularly dangerous in AI-assisted development.

Takeaways

AI agents can generate technical debt faster than human developers, requiring new approaches to code quality control.
The velocity benefits of AI coding tools may come at the cost of long-term code maintainability and team understanding.
Engineering teams need intentional practices to maintain discipline when AI makes rapid development so tempting.

from Mar 29, 2026 · via rss-willison

Intermediate

Show HN: Robust LLM extractor for websites in TypeScript

andrew_zhong

software-engineering how-we-work rag

A practical TypeScript library that solves the common problem of extracting structured data from websites using LLMs, addressing real pain points like HTML noise, token budget management, and brittleness of traditional CSS selectors. This represents the kind of focused tooling that makes AI-powered data extraction reliable enough for production use.

Takeaways

LLM-based extraction needs preprocessing to remove HTML noise and stay within token budgets for reliable results.
Focused tools that solve specific AI integration problems are more valuable than general-purpose solutions for production teams.
AI extraction can replace brittle CSS selectors but requires thoughtful engineering to handle edge cases and failures.

from Mar 29, 2026 · 72 points on HN · via api-hn

Intermediate

From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI

software-engineering how-we-work foundational opinion

As AI generates code faster than teams can understand it, traditional technical debt isn't the only concern — cognitive debt (team understanding erosion) and intent debt (missing rationale for decisions) become critical risks. This framework challenges teams to think beyond code quality and consider how AI affects shared understanding and knowledge capture. Essential reading for engineering leaders navigating the balance between AI velocity and long-term maintainability.

Takeaways

AI-generated code creates new forms of debt beyond traditional technical debt that can silently undermine team effectiveness.
Cognitive debt occurs when team understanding erodes faster than code accumulates, making future changes increasingly risky.
Intent debt — the absence of captured rationale — becomes critical when both humans and AI agents need to work safely with existing code.

from Mar 29, 2026 · via manual

Intermediate

Pi: The Minimal Agent Within OpenClaw

agents software-engineering how-we-work

Pi represents a minimalist approach to coding agents that focuses on doing fewer things extremely well rather than trying to be a general-purpose assistant. The author argues this constraint-driven design offers a glimpse into how production coding agents should be built — with clear boundaries and specific capabilities rather than attempting to solve every development task.

Takeaways

Minimalist agent design with clear constraints may be more effective than general-purpose coding assistants.
Focused agents that excel at specific tasks could be the future of AI-assisted development workflows.

from Mar 29, 2026 · via manual

Intermediate

Auto mode for Claude Code

agents security llms software-engineering

Anthropic introduces 'auto mode' for Claude Code that lets the AI make permission decisions autonomously, with a separate Claude model acting as a safety classifier before each action executes. This represents a sophisticated approach to the fundamental challenge of autonomous agents — how to give them freedom to act while maintaining safety guardrails through multi-model oversight.

Takeaways

Multi-model safety architectures can enable more autonomous agent behavior by having one model review another's planned actions.
Permission management in AI agents is evolving from binary allow/deny to context-aware decision making with built-in safeguards.

from Mar 29, 2026 · via rss-willison

Accessible

Coding agents for data analysis

agents software-engineering how-we-work

Comprehensive workshop content demonstrating practical applications of coding agents for data analysis workflows. Covers real-world use cases like database querying, data exploration, and cleaning tasks using Claude Code and OpenAI Codex. Extremely valuable for engineers building data analysis pipelines with LLMs, providing concrete examples and methodologies rather than theoretical frameworks.

Takeaways

Coding agents excel at automating data analysis workflows including database querying, exploration, and cleaning tasks.
Claude Code and OpenAI Codex provide practical frameworks for building data analysis pipelines with concrete implementation examples.
Workshop-style learning with real use cases is more valuable than theoretical frameworks for implementing coding agents.

from Mar 23, 2026 · via rss-willison

Intermediate

Agentic Harness for Real-World Compilers

Yingwei Zheng

llms agents software-engineering

Introduces the first specialized agentic framework for fixing compiler bugs, addressing the massive performance drop (60%) that frontier models experience when tackling compiler issues versus regular software bugs. The llvm-autofix system outperforms state-of-the-art by 22% and provides compiler-specific tools that general coding agents lack. Essential if you're building AI systems for low-level systems programming.

Takeaways

Frontier models experience a 60% performance drop on compiler bugs versus regular software bugs, requiring specialized tooling.
The llvm-autofix system outperforms general coding agents by 22% through compiler-specific tools and domain knowledge.
Building AI systems for specialized domains like systems programming requires domain-specific agentic frameworks.

from Mar 23, 2026 · 0 citations · via api-arxiv · arXiv:2603.20075

Accessible

Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

Maximiliano Armesto

software-engineering agents how-we-work

A rare longitudinal field study tracking real software modernization projects using human-AI collaboration across three major migrations. Shows concrete metrics: portfolio delivery time dropped from 36 project-weeks to 9.3, with modeled person-day savings of 73%. This provides actual evidence for AI productivity claims in enterprise software delivery, not just individual task benchmarks.

Takeaways

Real software modernization projects using human-AI collaboration reduced delivery time from 36 project-weeks to 9.3 with 73% person-day savings.
This provides concrete evidence for AI productivity claims in enterprise software delivery beyond individual task benchmarks.
Successful human-AI collaboration in software delivery requires orchestrated workflows, not just individual AI tool adoption.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.20028

Accessible

Ask HN: AI productivity gains – do you fire devs or build better products?

Bleiglanz

how-we-work software-engineering opinion

A candid Hacker News discussion on the real productivity impacts of AI coding tools, moving beyond hype to practical experience. The author reports massive gains for boilerplate, libraries, and refactoring work while questioning long-term claims for complex enterprise systems. Valuable for understanding the actual developer experience and managing realistic expectations about AI-assisted development.

Takeaways

AI coding tools show massive productivity gains for boilerplate, libraries, and refactoring work but mixed results for complex enterprise systems.
Managing realistic expectations about AI-assisted development requires understanding the gap between hype and practical developer experience.
Teams should focus AI adoption on well-defined, repetitive coding tasks rather than complex architectural decisions.

from Mar 23, 2026 · via api-hn

Intermediate

Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

Luiz C. Borro

agents llms software-engineering

Solves the expensive memory problem plaguing production LLM agents by treating memory as a data structuring challenge rather than dumping raw conversations into context. Memori converts dialogue into semantic triples and summaries, achieving 81% accuracy while using only 5% of full context tokens — resulting in 67% cost reduction over competing approaches. This is exactly what you need if you're building agents that need to remember across sessions without breaking the bank.

Takeaways

Converting dialogue to semantic triples and summaries can reduce memory costs by 95% while maintaining 81% accuracy in agent conversations.
Treating agent memory as a data structuring problem rather than raw context dumping achieves 67% cost reduction over competing approaches.
Persistent memory for production agents requires semantic compression techniques to scale economically.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.19935

Intermediate

How we monitor internal coding agents for misalignment

security agents evaluations software-engineering

OpenAI reveals their internal methodology for monitoring coding agents for misalignment in real production deployments. This isn't theoretical safety research — it's practical guidance on detecting when your coding agents start exhibiting dangerous behaviors. Critical reading for any team deploying AI coding assistants, as it provides concrete monitoring techniques and risk detection strategies.

Takeaways

OpenAI's internal monitoring for coding agent misalignment focuses on detecting dangerous behaviors in real production deployments rather than theoretical safety.
Concrete monitoring techniques and risk detection strategies are essential for any team deploying AI coding assistants in production.
Misalignment monitoring should be built into coding agent deployment pipelines from day one.

from Mar 23, 2026 · via rss-openai