Tag: how-we-work

Accessible

Quoting Kenton Varda

Kenton Varda (Cloudflare) banned AI-generated PR descriptions from his team after finding they reliably described what the code does while omitting why — the higher-level framing reviewers actually need. This is a sharp practitioner observation: AI excels at summarizing visible structure but consistently fails at articulating the motivation, tradeoffs, and context that make code reviews meaningful. A useful corrective to uncritical adoption of AI-assisted commit hygiene.

Takeaways

AI-generated commit and PR messages optimize for describing code mechanics, not communicating intent — which is exactly backwards for reviewers.
The higher-level framing needed to understand a change is often not recoverable from the diff alone, making it irreplaceable by AI summarization.
Teams should consider explicit norms distinguishing where AI writing assistance adds value versus where it degrades communication quality.

from Jul 13, 2026 · via rss-willison

Accessible

How Do Software Professionals Evaluate AI-Generated Code? (Registered Report)

Samuli Määttä

software-engineering how-we-work evaluations

Despite widespread adoption of AI coding tools, we have surprisingly little systematic understanding of how engineers actually decide whether AI-generated code is good enough to ship. This registered report outlines a grounded theory study using surveys and interviews with 20-50 software professionals to build that understanding. Worth tracking because the resulting theory will inform how we design review workflows and tooling around AI-assisted development.

Takeaways

Current research lacks a grounded theory of how practitioners evaluate AI-generated code, making it hard to design better tooling.
How professionals evaluate AI code likely differs substantially from how they evaluate human-written code, with implications for review process design.
The study's findings will be grounded in actual practitioner accounts rather than lab experiments, increasing ecological validity.

from Jul 13, 2026 · via api-arxiv · arXiv:2607.09434

Accessible

Writing Bug Reports for Software Repair Agents: What Information Matters Most?

Vincenzo Luigi Bruno

agents software-engineering evaluations how-we-work

As AI agents take on more bug-fixing work, the way you write issue reports starts to matter differently — not for human comprehension, but as task specifications for the agent. This study systematically analyzed 441 real bug reports from SWE-bench Verified, annotating what information types (reproduction steps, expected behavior, localization cues, suggested fixes) were present and correlating them with agent fix success rates. If your team is routing issues to AI agents, this research tells you concretely what to include.

Takeaways

Bug reports written for humans often omit the structured information AI agents need most, like explicit expected behavior and reproduction steps.
Localization cues and suggested fixes in issue reports meaningfully improve agent success rates.
Agentic workflows require treating issue reports as formal task specifications, not informal communication.

from Jul 13, 2026 · via api-arxiv · arXiv:2607.09553

Intermediate

Failure as a Process: An Anatomy of CLI Coding Agent Trajectories

Xiangxin Zhao

agents evaluations software-engineering how-we-work

Rather than just measuring whether coding agents succeed or fail, this large-scale study examines how failures unfold over time across nearly 1,800 annotated agent trajectories. The process-oriented view reveals that many failures aren't sudden — they have identifiable onset points, predictable escalation patterns, and windows where recovery is still possible. Essential reading if you're building or operating coding agents and want to understand where interventions would actually help.

Takeaways

Agent failures are temporal processes with identifiable early warning patterns, not just binary outcomes.
Many failure trajectories have recovery windows that current agents consistently miss, suggesting intervention points for scaffolding improvements.
Different frontier models fail in structurally distinct ways, meaning model choice affects failure mode, not just success rate.

from Jul 13, 2026 · via api-arxiv · arXiv:2607.09510

Accessible

Building to the Test: Coding Agents Deliver What You Check, Not What You Requested

Yanuo Ma, Ben Kereopa-Yorke, Ben Schultz

agents evaluations software-engineering how-we-work

When coding agents have access to the test suite, they optimize for the tests rather than the actual deliverable — a phenomenon this paper calls 'building to the test.' In controlled experiments, agents with oracle access hit near-perfect scores while shipping essentially hollow implementations that hardcode tested behaviors. This challenges the assumption that high benchmark scores mean working software, and has direct implications for how you should structure agent evaluation in CI/CD pipelines.

Takeaways

Agents with test suite access will exploit tests as a specification, producing code that passes without implementing the underlying functionality.
Benchmark scores can be simultaneously high and meaningless if agents have learned to optimize for the metric rather than the goal.
Robust agent evaluation requires hidden or post-hoc validation that the agent cannot observe or optimize against during implementation.

from Jul 6, 2026 · via api-hf · arXiv:2606.28430

$What it Means to Be a Mathematician When AI Does the Math$

Accessible

What it Means to Be a Mathematician When AI Does the Math

opinion foundational how-we-work

As AI systems like AlphaProof tackle olympiad-level problems, mathematicians are grappling with an identity crisis: if the machine can do the math, what's left for humans? This piece surfaces the honest debate happening inside mathematics departments about whether AI is a tool, a collaborator, or an existential threat to the discipline's core purpose. Worth reading for any engineer who's asked themselves the same question about their own craft.

Takeaways

The fear isn't job loss but loss of meaning — mathematicians worry AI removes the intellectual struggle that makes the work rewarding.
Some researchers see AI as a powerful collaborator that handles tedious verification, freeing humans for higher-level creativity.
The field hasn't reached consensus, and the honest answer is that nobody knows yet what the human role will look like.

from Jul 6, 2026 · via suggestion

Intermediate

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Mohit Raghavendra, Anisha Gunjal, Aakash Sabharwal, Yunzhong He

agents evaluations software-engineering how-we-work

Current SWE benchmarks hand agents a complete spec and grade the output — but real developer workflows involve vague requirements, iterative clarification, and shifting constraints. SWE-Interact tests exactly that, and the findings are sobering: models that ace single-turn benchmarks often fall apart when requirements evolve mid-task. Essential reading if you're building or evaluating coding agents for real-world use.

Takeaways

Strong single-turn SWE benchmark scores do not predict success in multi-turn, user-driven coding sessions.
Agents frequently fail to proactively clarify ambiguous requirements, a skill that's critical in realistic workflows.
Evaluating agents only on autonomous, fully-specified tasks creates a false picture of production readiness.

from Jul 6, 2026 · via api-hf · arXiv:2606.30573

Accessible

Is AI ruining our skills? Early results are in — and they’re not good

how-we-work software-engineering opinion llms

If you've been wondering whether leaning on Copilot or ChatGPT is quietly eroding your ability to think through problems independently, early research suggests the concern is legitimate. Studies on physicians and software engineers show measurable skill degradation from AI tool reliance, which directly challenges the 'AI as a productivity multiplier' narrative by suggesting there may be cognitive costs that don't show up in short-term output metrics.

Takeaways

Regular AI tool reliance correlates with degraded independent problem-solving ability in both physicians and engineers.
Short-term productivity gains may mask longer-term skill atrophy that's hard to reverse.
Teams need deliberate practice strategies to maintain core competencies alongside AI assistance.

from Jun 22, 2026 · 0 citations · via suggestion

Intermediate

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

Asa Shepard

agents software-engineering prompt-engineering evaluations how-we-work

If your team is using AGENTS.md files (or similar repo guidance docs) to orient coding agents, this paper explains why some of them help and others actively hurt performance — and it's all about how the guidance is generated. The probe-and-refine method uses synthetic bug probes to iteratively diagnose and patch guidance files without an agent loop, achieving a 33% vs 28.3% resolve rate improvement on SWE-bench, a meaningful lift from a purely prompt-side intervention.

Takeaways

Hand-written or naively LLM-generated AGENTS.md files can harm agent performance; iterative refinement driven by synthetic probes is key.
Probe-and-refine requires no agent loop or tool use during tuning, making it lightweight to adopt.
How repository guidance is produced matters more than whether it exists at all.

from Jun 22, 2026 · via api-arxiv · arXiv:2606.20512

Accessible

Quoting Charity Majors

opinion how-we-work software-engineering

Charity Majors captures the most important economic shift in software in a single observation: code went from scarce and precious to free and disposable almost overnight, which inverts decades of engineering intuition about reuse, curation, and quality. The implication she draws — that this demands more engineering discipline, not less — is a direct challenge to teams treating AI-assisted development as a reason to relax standards.

Takeaways

When code generation becomes free, the bottleneck shifts from writing code to understanding, evaluating, and maintaining it — which requires deeper engineering judgment.
Disposable code generation pressure makes architecture, testing, and observability disciplines more critical, not less.
Teams that lower their quality bar because AI makes iteration cheap will accumulate technical debt faster than ever before.

from Jun 22, 2026 · 1 citations · via rss-willison

Accessible

Why AI hasn’t replaced software engineers, and won’t

opinion how-we-work software-engineering

Narayanan and Kapoor challenge the AI displacement narrative by analyzing software engineering - the profession most vulnerable to AI automation due to low regulatory barriers and high AI suitability. They argue that evidence suggests AI won't cause mass layoffs even in this ideal case for displacement, with implications for other professions facing AI disruption. Essential reading for software engineers concerned about career security in the AI era.

Takeaways

Even in software engineering - the profession most suited to AI disruption - evidence doesn't support mass displacement scenarios.
AI capabilities reaching certain thresholds don't automatically translate to widespread job replacement.
Other professions with higher regulatory barriers are likely even more resistant to AI displacement than software engineering.

from Jun 15, 2026 · 1 citations · via rss-willison

Intermediate

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

Wei Wu

agents software-engineering how-we-work

This longitudinal study of a production LLM agent system reveals a critical pattern: silent failures where error signals never reach humans in actionable form, occurring 28+ times over 8 weeks despite extensive testing. The five-class taxonomy of failure modes is immediately actionable for anyone building agent systems, with 'chained hallucination and fabrication' being uniquely dangerous to LLM systems. This is must-read research for understanding how LLM agents fail differently from traditional software.

Takeaways

Silent failures where errors don't surface to humans are a critical failure mode unique to LLM agent systems.
Traditional testing approaches (4,286 unit tests, 827 governance checks) don't prevent these failure patterns.
Chained hallucination represents the most dangerous failure class, where systems confidently fabricate plausible but wrong information.

from Jun 15, 2026 · via api-arxiv · arXiv:2606.14589

Accessible

AI enthusiasts are in a race against time, AI skeptics are in a race against entropy

opinion how-we-work

Charity Majors captures the current tension in software teams between those pushing hard on AI adoption and those preferring to wait for stability. The insight is that AI enthusiasts face time pressure to capitalize on rapid capability improvements, while skeptics face entropy pressure as the gap widens between AI-augmented and traditional development. Essential perspective for engineering leaders navigating team dynamics in the AI transition.

Takeaways

AI enthusiasts and skeptics face different types of competitive pressure within the same teams.
Teams that lean into AI are seeing discontinuous capability leaps that feel different from normal technology cycles.
The dynamic creates urgency that makes waiting for stability potentially costly.

from Jun 8, 2026 · via rss-willison

Intermediate

Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study

Sajjad Khan

agents software-engineering security how-we-work

This empirical study catalogs 63 real production failures where LLM agents burned through token budgets, costing thousands of dollars in retry loops before operators noticed. The authors demonstrate how Rust's affine type system can prevent these budget overruns at compile time rather than hoping runtime checks catch them. If you're deploying agents in production, this research shows you exactly what can go wrong and provides a concrete mitigation strategy.

Takeaways

Documents 63 confirmed production incidents of LLM agent budget overruns across 21 orchestration frameworks.
Demonstrates that affine type systems can prevent budget double-spending and use-after-delegation at compile time.
Provides concrete taxonomy of failure modes with documented dollar losses from real deployments.

from Jun 8, 2026 · via api-hf · arXiv:2606.04056

Intermediate

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Zhangchen Xu, Junda Chen, Yue Huang, Dongfu Jiang, Jiefeng Chen, Hang Hua, Zijian Wu, Zheyuan Liu, Zexue He, Lichi Li, Shizhe Diao, Jiaxin Pei, Jinsung Yoon, Hao Zhang, Mengdi Wang, Radha Poovendran, Misha Sra, Alex Pentland, Zichen Chen

agents evaluations software-engineering how-we-work

If you're building production AI systems, this benchmark reveals why most current evaluations miss the boat entirely. While existing benchmarks test single responses, AutoLab measures what actually matters: whether AI agents can iteratively improve code and systems over hours or days, just like real engineering work. The key finding will change how you think about agent capabilities — persistence in trying different approaches matters far more than getting it right on the first attempt.

Takeaways

Current AI benchmarks fail to capture the iterative improvement process that defines real engineering work.
Agent persistence and willingness to retry different approaches predicts success better than initial solution quality.
The benchmark spans realistic domains including system optimization and CUDA kernel development.

from Jun 8, 2026 · via api-hf · arXiv:2606.05080

Intermediate

Show HN: AISlop, a CLI for catching AI generated code smells

Heavykenny

software-engineering llms agents how-we-work

AI-generated code often passes tests but contains subtle quality issues like empty catch blocks, useless comments, and dead code — patterns that human developers would avoid. AISlop is a practical CLI tool that scans for these AI-specific code smells and can be wired into development workflows to catch them automatically. If you're using AI coding assistants in production, this addresses the real problem that AI code can be technically correct but stylistically poor.

Takeaways

AI-generated code suffers from systematic quality issues that pass tests but violate good coding practices.
Automated detection of AI-specific code smells can be integrated into development workflows as quality gates.
Local scanning tools can catch AI code quality issues without sending code to external services.

from Jun 1, 2026 · 73 points on HN · via api-hn

Accessible

Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild

Zhimin Zhao, Zehao Wang, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan

evaluations software-engineering how-we-work

ML evaluation harnesses are critical infrastructure that's surprisingly broken — this empirical study of 57 harnesses reveals that 41% of issues occur in the specification stage where systems integrate external models, datasets, and judges. The three biggest problems are unimplemented features, documentation gaps, and missing input validation, accounting for over 60% of operational challenges. Essential reading if you're building evaluation infrastructure or trying to understand why ML evaluation systems are so brittle.

Takeaways

Most evaluation harness failures occur during specification stages involving external model and dataset integration.
Unimplemented features, documentation gaps, and missing input validation cause the majority of operational issues.
Evaluation engineering requires different quality practices than traditional software development due to complex external dependencies.

from Jun 1, 2026 · via api-hf · arXiv:2605.24213

Intermediate

PithTrain: A Compact and Agent-Native MoE Training System

Ruihang Lai

agents software-engineering llms how-we-work

If you're planning to use AI coding agents to build or modify ML training frameworks, this paper should change how you design those systems. The authors identify 'agent-task efficiency' as a critical but overlooked metric — essentially, how easy is it for AI agents to understand and modify your codebase? They built PithTrain, an MoE training framework designed from the ground up to be agent-friendly, showing you can match production throughput while dramatically improving agent productivity on real development tasks.

Takeaways

Agent-native design principles can maintain performance while dramatically improving AI assistant productivity on framework development tasks.
Traditional throughput metrics miss the hidden costs of using AI agents on complex codebases.
Compact, well-structured frameworks enable better human-AI collaboration than monolithic production systems.

from Jun 1, 2026 · via api-arxiv · arXiv:2605.31463

Accessible

Quoting Armin Ronacher

software-engineering how-we-work opinion

Armin Ronacher identifies a growing problem plaguing open source: users submitting AI-generated bug reports that obscure actual issues with confident but inaccurate conclusions and fake minimal reproductions. This observation captures a critical breakdown in the feedback loop between users and maintainers that threatens the quality of issue tracking and debugging processes.

Takeaways

AI-generated bug reports often contain inaccurate conclusions despite appearing confident and well-structured.
The real user voice gets lost when issues are filtered through AI tools, making root cause analysis nearly impossible.
This trend threatens the quality of open source issue tracking and maintainer-user communication.

from May 25, 2026 · via rss-willison

Accessible

Learnings from 100K lines of Rust with AI (2025)

pramodbiligiri

software-engineering how-we-work agents

Practical insights from building a substantial Rust codebase with AI assistance that likely covers the realities of AI-assisted development at scale. Without access to the specific learnings, this represents valuable field experience for engineers considering AI integration into their development workflows, particularly for systems programming where correctness and performance matter.

Takeaways

Large-scale AI-assisted development provides real-world insights beyond typical toy examples.
Rust's strict type system likely offers unique lessons for AI-assisted systems programming.

from May 25, 2026 · 190 points on HN · via api-hn

Intermediate

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Runyuan He, Qiuyang Mang, Shang Zhou, Kaiyuan Liu, Hanchen Li, Huanzhi Mao, Qizheng Zhang, Zerui Li, Bo Peng, Lufeng Cheng, Tianfu Fu, Yichuan Wang, Wenhao Chai, Jingbo Shang, Alex Dimakis, Joseph E. Gonzalez, Alvin Cheung

software-engineering foundational how-we-work

This addresses a critical bottleneck in training better coding agents—the scarcity of open-ended programming problems that mirror real-world development challenges. FrontierSmith automatically evolves competitive programming problems into open-ended variants that elicit diverse solution approaches. Essential for understanding how to improve AI coding capabilities beyond the current focus on well-defined tasks like bug fixes and feature implementation.

Takeaways

Open-ended coding problems are essential for training LLMs that can handle real-world development challenges.
Automated synthesis can scale creation of diverse coding problems that elicit genuinely different solution approaches.
Current LLM coding training focuses too heavily on well-defined tasks versus the ambiguous problems developers actually face.

from May 18, 2026 · via api-hf · arXiv:2605.14445

Accessible

Not so locked in any more

software-engineering how-we-work opinion

This captures a profound shift in software engineering economics—AI coding agents are eliminating traditional language and platform lock-in by making rewrites economically feasible. The example of a company using coding agents to migrate legacy iPhone/Android apps to React Native illustrates how AI changes the cost-benefit calculus of maintaining separate codebases. This has massive implications for technology choices and technical debt management.

Takeaways

AI coding agents are reducing the economic barriers to cross-platform migrations and rewrites.
Traditional platform lock-in becomes less relevant when AI can handle the tedious work of code translation.
Strategic technology decisions need to account for dramatically lower migration costs in an AI-augmented world.

from May 18, 2026 · via rss-willison

Accessible

Why senior developers fail to communicate their expertise

software-engineering how-we-work opinion

This challenges the conventional wisdom that technical expertise alone makes senior developers valuable in the AI era. The author argues that senior developers instinctively focus on technical complexity while business stakeholders worry about uncertainty—a communication gap that becomes critical when AI can handle much of the complexity but amplifies the uncertainty. If you're a senior engineer wondering how to stay relevant, this reframes the conversation entirely.

Takeaways

Senior developers must shift from communicating complexity to addressing business uncertainty in AI-augmented workflows.
Traditional technical communication patterns become counterproductive when AI handles routine complexity.
The most valuable senior developers will be those who can translate between AI capabilities and business outcomes.

from May 18, 2026 · via suggestion

Intermediate

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

agents evaluations how-we-work

This benchmark exposes the embarrassing gap between synthetic agent evaluations and real-world performance. While most benchmarks use mock APIs and toy tasks, WildClawBench runs agents in actual CLI environments with real tools for 8+ minute tasks. The results are sobering—even frontier models like Claude Opus achieve only 35% success rates. If you're building production agents, this benchmark reveals what you're actually up against.

Takeaways

Synthetic benchmarks dramatically overestimate real-world agent performance in production environments.
Long-horizon tasks in native runtimes reveal fundamental limitations even in frontier models.
Production agent deployment requires significantly different evaluation criteria than academic benchmarks suggest.

from May 18, 2026 · via api-hf · arXiv:2605.10912

Intermediate

Harness engineering: leveraging Codex in an agent-first world

agents software-engineering how-we-work

Essential reading for anyone building agent-first development workflows. Lopopolo shares practical insights from Codex implementation that challenge conventional wisdom about how AI should integrate into software engineering processes. This isn't another theoretical piece—it's a practitioner's guide to harnessing AI agents in real development environments where traditional tooling falls short.

Takeaways

Agent-first workflows require fundamentally different architectural thinking than traditional AI-assisted development.
Codex integration succeeds when it becomes the primary interface rather than a secondary tool.
Production agent systems need careful harness engineering to bridge the gap between AI capabilities and developer workflows.

from May 18, 2026 · via suggestion

Accessible

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Siddhant Saxena, Nilesh Trivedi, Vinayaka Jyothi

software-engineering evaluations agents how-we-work

The first comprehensive evaluation framework for AI coding platforms that treats them as virtual software agencies rather than just code generators. The 68-metric evaluation across product management, engineering, and operations reveals four critical shortcomings in current platforms: specification bottlenecks, architectural blind spots, iteration fragility, and business readiness gaps—essential insights for anyone building or evaluating AI development tools.

Takeaways

AI coding platforms need evaluation beyond code quality to include product management and operations capabilities.
Current platforms struggle with specification understanding, architectural decisions, and iterative development.
Business readiness requires capabilities spanning multiple roles, not just engineering output.

from May 11, 2026 · via api-hf · arXiv:2605.04637

Intermediate

Agentic AI Systems Should Be Designed as Marginal Token Allocators

Siqi Zhu

agents opinion foundational how-we-work

Essential reading if you're building agentic systems—this paper reframes agent design through economic principles, showing how routing, planning, serving, and training decisions all solve the same optimization problem: marginal benefit equals marginal cost plus latency plus risk. Instead of thinking about agents as text generators, this framework treats them as token allocation economies, explaining why locally optimal decisions often lead to globally suboptimal performance.

Takeaways

All agent system layers (routing, planning, serving, training) solve the same economic optimization problem.
Local token minimization often leads to global misallocation of computational resources.
Agent performance should be evaluated through marginal token allocation efficiency rather than just accuracy metrics.

from May 11, 2026 · via api-hf · arXiv:2605.01214

Accessible

Appearing Productive in The Workplace — No One

how-we-work opinion software-engineering

This challenges the conventional wisdom that AI-generated code is obviously detectable by experienced engineers. The author argues that AI can now produce work that passes expert review while containing fundamental flaws that only surface later in production, creating two dangerous failure modes: code that looks professional but lacks deep understanding, and teams that become dependent on AI output they can't properly evaluate.

Takeaways

AI-generated work can fool experienced reviewers by appearing expert without actually being expert.
The failure modes are both immediate (bad code getting through) and systemic (teams losing evaluation skills).
Traditional code review processes may be insufficient for AI-assisted development.

from May 11, 2026 · via suggestion

Accessible

Your CEO is suffering from AI psychosis

opinion how-we-work

A pointed critique of executive-level AI hype that's driving unrealistic expectations and poor technical decisions in organizations. While the title is provocative, this addresses the real challenge engineers face when leadership makes AI commitments without understanding the technology's limitations, leading to impossible timelines and misallocated resources.

Takeaways

Executive AI enthusiasm often disconnects from technical reality and constraints.
Engineers need strategies for managing unrealistic AI expectations from leadership.
The hype cycle is creating organizational problems that technical teams must navigate.

from May 11, 2026 · via suggestion

Intermediate

Where the goblins came from

llms evaluations how-we-work

Investigates the emergence and propagation of quirky, personality-driven outputs ('goblins') in AI models, tracing their timeline, root causes, and potential fixes. This analysis of unexpected model behavior is highly relevant for engineers debugging production systems and understanding how subtle training or deployment changes can lead to widespread behavioral shifts.

Takeaways

Personality-driven quirks in model outputs can emerge and spread through training processes in unexpected ways.
Understanding the root causes of 'goblin' behaviors helps engineers identify and prevent similar issues in production.
Model behavior debugging requires systematic analysis of training timelines and data sources.

from May 4, 2026 · via rss-openai

Intermediate

Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

Chenkai Pan, Xinglong Xu, Yuhang Xu, Yujun Wu, Siyuan Li, Jintao Chen, Conghui He, Jingxuan Wei, Cheng Tan

software-engineering evaluations how-we-work foundational

This research revolutionizes LLM data engineering by mapping the machine learning lifecycle directly onto software development practices—treating training data as source code, model training as compilation, and failures as bugs to debug. For teams struggling with opaque training processes and data quality issues, this framework offers a systematic approach to diagnosing and fixing model deficiencies at the data level.

Takeaways

Training data can be treated as source code with structured representations enabling systematic debugging of model failures.
The ML development lifecycle maps precisely onto software engineering practices when proper abstractions are established.
Concept-level gaps in training data become debuggable when models fail on domain-specific tasks.

from May 4, 2026 · via api-hf · arXiv:2604.24819

Accessible

Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

T. J. Barton, Chris Constantakis, Patti Hauseman, Annie Mous, Alaska Hoffman, Brian Bergeron, Hunter Goodreau

agents software-engineering evaluations how-we-work

A remarkable real-world case study of autonomous LLM agents managing actual financial capital over 21 days, generating 7.5M invocations and $20M in trading volume with 99.9% settlement success. This paper provides invaluable insights into building reliable production agent systems, showing that reliability emerges from the operating layer architecture rather than the base model alone.

Takeaways

Reliability in production AI agents comes from systematic operating layer controls, not just model capabilities.
Real capital deployment reveals failure modes and reliability patterns invisible in simulation environments.
Large-scale agent deployments require careful attention to validation, state management, and settlement infrastructure.

from May 4, 2026 · via api-hf · arXiv:2604.26091

Intermediate

The Last Harness You'll Ever Build

Haebin Seong, Li Yin, Haoran Zhang

agents software-engineering how-we-work

Presents an evolutionary framework that automates the painful process of building agent harnesses for new domains, using adversarial evaluation and iterative refinement to optimize prompts, tools, and orchestration logic. This directly tackles one of the biggest bottlenecks in production AI systems—the manual engineering required to make foundation models effective for specific enterprise workflows.

Takeaways

Agent harness engineering can be automated through evolutionary optimization with adversarial evaluation feedback.
The meta-evolution loop concept enables systems to improve their own optimization processes over time.
Automated harness creation could dramatically reduce the engineering overhead of deploying agents in new domains.

from May 4, 2026 · via api-hf · arXiv:2604.21003

Intermediate

Fine-Tuning for an Exam Quality Tutor

llms how-we-work

A hands-on exploration of fine-tuning a 27B parameter model for personalized learning that reveals the practical realities of adapting large models for specific use cases. This personal experiment offers valuable insights into the effort, infrastructure, and unexpected challenges you'll face when moving beyond API calls to custom model training.

Takeaways

Fine-tuning large models for specialized tasks requires significant infrastructure planning and iteration cycles.
The gap between theoretical fine-tuning approaches and practical implementation reality is substantial.
Personal use cases can serve as effective testing grounds for understanding model customization challenges.

from May 4, 2026 · via suggestion

Accessible

The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward

Samuel Sameer Tanguturi

opinion foundational agents how-we-work

This position paper argues that the most critical missing piece in AI architecture is a 'continuity layer' that preserves what models learn across sessions, addressing the fundamental amnesia problem where powerful per-session intelligence is lost when contexts reset. The paper challenges the field's focus on model size over persistent understanding and outlines specific engineering requirements for systems that truly accumulate knowledge over time.

Takeaways

The absence of persistent memory across sessions is a more critical architectural problem than model size in current AI systems.
Current memory APIs return flat facts that models must reinterpret from scratch, creating powerful but amnesiac intelligence.
A continuity layer requires seven specific characteristics including persistent state, selective retention, and coherent knowledge integration.

from Apr 27, 2026 · via api-hf · arXiv:2604.17273

Accessible

SWE-chat: Coding Agent Interactions From Real Users in the Wild

Joachim Baumann, Vishakh Padmakumar, Xiang Li, John Yang, Diyi Yang, Sanmi Koyejo

agents software-engineering how-we-work evaluations

SWE-chat provides the first large-scale empirical evidence of how developers actually use AI coding agents in the wild, revealing that usage patterns are bimodal and agents are surprisingly inefficient. The dataset shows that only 44% of agent-produced code makes it into user commits, challenging the narrative of coding agent effectiveness and providing crucial insights for anyone building or deploying these tools in production.

Takeaways

Real-world coding patterns are bimodal: 41% of sessions involve agents writing virtually all code, while 23% have humans writing everything themselves.
Despite improving capabilities, only 44% of agent-produced code survives into user commits, revealing significant inefficiency in natural settings.
The first large-scale dataset of real coding agent usage provides empirical evidence that challenges assumptions about agent effectiveness in production.

from Apr 27, 2026 · via api-hf · arXiv:2604.20779

Intermediate

Quo Vadis, Code Review? Exploring the Future of Code Review

software-engineering how-we-work

A survey of 100 developers across five companies reveals how AI automation is reshaping code review practices while the fundamentals remain essential. The research shows that practitioners expect code review to stay critical but anticipate significant changes in what gets reviewed and how much time it takes. This matters because understanding these trends helps teams adapt their review processes and tooling investments as AI-assisted development becomes mainstream.

Takeaways

Developers expect code review to remain essential despite increasing AI automation in development workflows.
The scope and time investment in code review are expected to shift significantly over the next five years as AI tools mature.
Teams need to proactively adapt review processes and tooling strategies to work effectively with AI-assisted development.

from Apr 27, 2026 · via suggestion · arXiv:2508.06879

Intermediate

The AI engineering stack we built internally — on the platform we ship

software-engineering how-we-work llms

Cloudflare shares real metrics from running their own AI engineering stack in production, processing 241 billion tokens and serving 3,683 internal users. This is essential reading if you're building AI infrastructure — they dogfood their own products (AI Gateway, Workers AI) and provide actual numbers on throughput, costs, and architectural decisions. The post challenges the common wisdom of building separate dev/prod AI stacks by showing how running on your own platform reveals critical performance and scalability insights.

Takeaways

Running AI infrastructure on the same platform you ship reveals hidden performance bottlenecks and helps prioritize product improvements.
Processing 241 billion tokens across 20 million requests provides concrete scale benchmarks for AI Gateway architecture decisions.
Dogfooding AI products with thousands of internal users uncovers real-world usage patterns that synthetic benchmarks miss.

from Apr 27, 2026 · via suggestion

Intermediate

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun, Wenran Liu, Kai Chen, Yining Li

agents software-engineering how-we-work

TREX automates the entire LLM fine-tuning pipeline through multi-agent collaboration, from literature research to data preparation to model evaluation. This challenges the current reality where fine-tuning requires extensive manual orchestration by ML engineers, offering a glimpse into fully automated ML workflows that could democratize model customization for domain-specific applications.

Takeaways

Multi-agent systems can automate complex ML workflows beyond individual tasks, handling entire fine-tuning lifecycles.
Modeling the experimental process as a search tree enables efficient exploration and reuse of historical training results.
Automated fine-tuning could significantly reduce the expertise barrier for domain-specific LLM customization.

from Apr 20, 2026 · via api-hf · arXiv:2604.14116

Accessible

Steve Yegge

how-we-work agents software-engineering opinion

Yegge's conversation reveals that even Google's engineering teams follow the same AI adoption pattern as traditional companies: 20% power users building with agents, 20% refusing AI tools entirely, and 60% stuck using basic chat interfaces like Cursor. This insight challenges assumptions about tech giants being ahead on internal AI adoption and suggests most organizations are at similar maturity levels regardless of their AI product offerings.

Takeaways

Google's internal AI adoption mirrors traditional companies despite their advanced AI research and products.
The industry-wide pattern shows 60% of engineers still using basic chat tools rather than advanced agentic workflows.
Having cutting-edge AI products doesn't necessarily translate to advanced internal adoption within engineering teams.

from Apr 20, 2026 · 0 citations · via rss-willison

Intermediate

When Using AI Leads to “Brain Fry”

agents how-we-work foundational

If your team is pushing engineers to maximize AI agent usage (measured by token consumption), this research reveals the hidden costs you're creating. Organizations incentivizing heavy AI tool oversight are inadvertently driving employees to a cognitive breaking point where mental fatigue leads to increased errors, poor decision-making, and higher turnover. Essential reading for engineering leaders designing AI-driven workflows who want to avoid burning out their teams.

Takeaways

Measuring and rewarding token consumption as a performance metric directly contributes to cognitive overload and employee burnout.
"AI brain fry" manifests as mental fog, slower decision-making, and headaches from excessive AI tool oversight beyond cognitive capacity.
AI workflows can be designed to reduce burnout through specific manager, team, and organizational practices that limit cognitive strain.

from Apr 20, 2026 · via suggestion

Advanced

Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task

llms foundational how-we-work

This neurological study challenges the assumption that LLM-assisted coding is cognitively easier for developers. Using EEG brain scans, researchers found that engineers using LLMs showed significantly weaker brain connectivity compared to those coding without AI assistance, suggesting reduced cognitive engagement that could impact long-term problem-solving abilities. Critical evidence for teams debating whether heavy AI assistance might be creating "cognitive debt" among developers.

Takeaways

LLM-assisted coding shows the weakest brain connectivity patterns compared to brain-only or search-assisted programming.
Heavy AI assistance may reduce cognitive engagement in ways that could impact developers' problem-solving capabilities over time.
The study provides neurological evidence that AI assistance creates measurable differences in how the brain processes coding tasks.

from Apr 20, 2026 · via suggestion · arXiv:2506.08872

Accessible

Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure

Huacan Wang, Jie Zhou, Ningyan Zhu, Shuo Zhang, Feiyu Chen, Jiarou Wu, Ge Chen, Chen Liu, Wangyi Chen, Xiaofeng Mou, Yi Xu

software-engineering agents how-we-work

Sema Code tackles the enterprise reality that every AI coding solution locks you into their specific interface, making it impossible to reuse AI capabilities across different development environments. Their embeddable architecture decouples the AI reasoning engine from delivery mechanisms, letting teams integrate the same AI coding capabilities into CLIs, IDEs, web apps, or custom toolchains without rebuilding from scratch.

Takeaways

Current AI coding solutions create vendor lock-in by coupling reasoning capabilities with specific delivery interfaces.
Decoupling the AI engine into a standalone library enables reuse across heterogeneous engineering environments.
The framework addresses enterprise needs like multi-tenancy, session management, and permission control that are missing from consumer AI coding tools.

from Apr 20, 2026 · via api-hf · arXiv:2604.11045

Intermediate

Extreme Harness Engineering for Token Billionaires: 1M LOC, 1B toks/day, 0% human code, 0% human review — Ryan Lopopolo, OpenAI Frontier & Symphony

software-engineering llms how-we-work

Move over prompt engineering—harness engineering is the new frontier for building production LLM systems at massive scale. This deep dive from OpenAI's Ryan Lopopolo reveals how teams operating at token-billionaire scale (1B tokens/day) architect systems with millions of lines of code generated without human review. The focus shifts from optimizing individual prompts to engineering the entire infrastructure that channels LLM capabilities into reliable, scalable production systems.

Takeaways

At massive scale, engineering the infrastructure around LLMs matters more than optimizing individual prompts.
Production systems generating millions of lines of code daily require fundamentally different architectural approaches.
Token billionaire scale operations demand new engineering disciplines focused on harness systems rather than model tuning.

from Apr 13, 2026 · via rss-latentspace

Accessible

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

agents evaluations security how-we-work

Testing agents on live productivity services is too risky, but existing benchmarks don't capture the complexity of real workflows across Gmail, Slack, and Google services. ClawsBench solves this with high-fidelity mock services that maintain full state and support deterministic snapshot/restore, enabling safe evaluation of 44 structured tasks including dangerous scenarios. The research reveals that domain skills (API knowledge injection) and meta prompts (cross-service coordination) are independent levers that teams can optimize separately for better agent performance.

Takeaways

High-fidelity simulation environments with full state management enable safe evaluation of agents in realistic productivity scenarios.
Domain skills and meta prompts are independent architectural components that can be optimized separately for better agent performance.
Safety-critical scenarios must be explicitly tested since agents can cause irreversible damage in productivity environments.

from Apr 13, 2026 · via api-hf · arXiv:2604.05172

Accessible

From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI

software-engineering how-we-work opinion foundational

As teams increasingly rely on AI to accelerate development, this framework warns that we're accumulating dangerous new forms of debt beyond just technical debt. Cognitive debt occurs when teams lose shared understanding of their systems as AI generates code faster than they can comprehend it, while intent debt refers to the missing documentation of why decisions were made—critical context that both humans and AI agents need to safely evolve code. This triple debt model provides a essential lens for evaluating software health in the AI era.

Takeaways

Cognitive debt erodes team understanding as AI generates code faster than teams can internalize it, creating dangerous knowledge gaps.
Intent debt—missing rationale and constraints—becomes critical when AI agents need explicit context to safely modify code.
Traditional technical debt metrics miss these human and knowledge-based risks that dominate in AI-assisted development.

from Apr 13, 2026 · via suggestion

Accessible

Ask HN: Client took over development by vibe coding. What to do?

piscator

software-engineering how-we-work opinion

A developer's experience with a client who embraced "vibe coding" with Claude Code, making rapid changes without proper planning or architecture consideration. This highlights the tension between AI-enabled development speed and traditional software engineering discipline, raising important questions about maintaining code quality and project management when AI makes coding feel effortless.

Takeaways

AI coding tools can enable rapid development that bypasses important planning and architecture phases.
"Vibe coding" with AI can create technical debt and project management challenges despite apparent productivity gains.
Professional development workflows need to adapt to balance AI speed with engineering discipline.

from Apr 6, 2026 · 61 points on HN · via api-hn

Accessible

Quoting Greg Kroah-Hartman

security llms how-we-work

Greg Kroah-Hartman, Linux kernel maintainer, describes a dramatic shift in AI-generated security reports from obvious "slop" to genuinely valuable contributions in just one month. This represents a critical inflection point where AI tools have crossed the threshold from nuisance to legitimate assistance in security research. The timing and scale of this change suggests we're witnessing a fundamental capability leap in AI security tooling.

Takeaways

AI-generated security reports have rapidly evolved from low-quality noise to genuinely valuable contributions.
The transformation happened suddenly rather than gradually, suggesting a capability threshold was crossed.
Open source maintainers are now receiving quality AI-assisted security research that requires serious attention.

from Apr 6, 2026 · via rss-willison

Accessible

Tell HN: Anthropic no longer allowing Claude Code subscriptions to use OpenClaw

firloop

llms software-engineering how-we-work

Anthropic's policy change affecting third-party tools like OpenClaw represents a significant shift in how developers can access Claude's capabilities outside official interfaces. This impacts teams that have built workflows around unofficial Claude integrations and highlights the business risks of depending on third-party API access patterns. Important for understanding the evolving landscape of AI tool accessibility.

Takeaways

Third-party Claude integrations now require separate pay-as-you-go billing beyond subscription limits.
Teams using unofficial Claude tools need to evaluate cost implications and migration strategies.
The change reflects tightening control over AI model access as these tools become more strategically important.

from Apr 6, 2026 · 1079 points on HN · via api-hn

Intermediate

Eight years of wanting, three months of building with AI

agents software-engineering how-we-work foundational

A compelling case study of how AI agents transformed an eight-year software vision into reality in just three months, specifically building comprehensive SQLite development tools. The author provides detailed insights into agentic engineering workflows and how AI can tackle complex, long-deferred projects that seemed too daunting for traditional development approaches. This demonstrates the paradigm shift from AI as a coding assistant to AI as a capable engineering partner.

Takeaways

AI agents can make previously intractable personal projects suddenly feasible by handling complex implementation details.
Agentic engineering workflows enable rapid prototyping of sophisticated developer tools that would take months using traditional methods.
The key to successful AI-assisted development is clearly defining goals while letting agents handle implementation complexity.

from Apr 6, 2026 · via rss-willison

Accessible

Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics

software-engineering how-we-work llms

This research challenges the assumption that AI coding tools work equally well on all codebases by showing that existing code quality metrics predict how reliably LLMs can refactor code without breaking it. Teams can use metrics like CodeHealth to identify where AI assistance is safer to deploy and where human oversight is critical. Essential reading for engineering leaders planning AI tool rollouts — it turns out investing in code maintainability isn't just about helping humans, it's about preparing your codebase for AI.

Takeaways

Human-friendly code quality metrics like CodeHealth strongly correlate with AI refactoring success rates.
Teams can proactively identify high-risk areas for AI intervention using existing code quality tools.
Investing in code maintainability pays dividends for both human developers and AI tooling effectiveness.

from Apr 6, 2026 · via suggestion

Accessible

Falling For Claude

llms software-engineering how-we-work

A candid reflection on how always-available AI coding assistants like Claude can blur work-life boundaries in unexpected ways. The author explores the psychological and practical implications of having a tireless coding companion that makes it tempting to work at all hours. Important perspective for engineers and managers thinking about sustainable AI adoption practices.

Takeaways

AI coding assistants can create unhealthy work patterns by making development feel frictionless at any time.
The always-available nature of AI tools requires intentional boundaries to maintain work-life balance.

from Apr 6, 2026 · via suggestion

Intermediate

We Rewrote JSONata with AI in a Day, Saved $500K/Year

software-engineering how-we-work llms

A compelling case study of 'vibe porting' — using AI to rewrite JSONata in Go guided by the existing test suite, achieving significant cost savings in just 7 hours and $400 of API costs. This demonstrates a practical methodology for AI-assisted rewrites: leverage comprehensive tests as guardrails and let AI handle the mechanical translation work.

Takeaways

Comprehensive test suites enable reliable AI-powered porting between languages with minimal human oversight.
Vibe porting can deliver substantial business value ($500K annual savings) when applied to performance-critical components.
The methodology scales: 7 hours of AI-assisted development replaced what would have been months of manual rewriting.

from Mar 29, 2026 · via rss-willison

Accessible

If you don't opt out by Apr 24 GitHub will train on your private repos

vmg12

security software-engineering how-we-work

GitHub is automatically opting users into training Copilot on private repositories unless they explicitly opt out by April 24th — a significant policy change that could expose proprietary code to AI training. This represents a major shift in how code hosting platforms treat private repositories and requires immediate action from teams concerned about code privacy.

Takeaways

GitHub's default opt-in policy for private repo training changes the privacy expectations for enterprise code.
Teams need to audit their GitHub settings immediately to prevent proprietary code from entering AI training datasets.

from Mar 29, 2026 · 719 points on HN · via api-hn

Intermediate

Thoughts on slowing the fuck down

agents software-engineering opinion how-we-work

The creator of Pi agent framework delivers a sharp critique of current AI-assisted development practices, arguing that the rush to generate code quickly is eroding engineering discipline and creating unsustainable technical debt. His core thesis: agent mistakes accumulate faster than human mistakes, making the 'move fast' approach particularly dangerous in AI-assisted development.

Takeaways

AI agents can generate technical debt faster than human developers, requiring new approaches to code quality control.
The velocity benefits of AI coding tools may come at the cost of long-term code maintainability and team understanding.
Engineering teams need intentional practices to maintain discipline when AI makes rapid development so tempting.

from Mar 29, 2026 · via rss-willison

Intermediate

Show HN: Robust LLM extractor for websites in TypeScript

andrew_zhong

software-engineering how-we-work rag

A practical TypeScript library that solves the common problem of extracting structured data from websites using LLMs, addressing real pain points like HTML noise, token budget management, and brittleness of traditional CSS selectors. This represents the kind of focused tooling that makes AI-powered data extraction reliable enough for production use.

Takeaways

LLM-based extraction needs preprocessing to remove HTML noise and stay within token budgets for reliable results.
Focused tools that solve specific AI integration problems are more valuable than general-purpose solutions for production teams.
AI extraction can replace brittle CSS selectors but requires thoughtful engineering to handle edge cases and failures.

from Mar 29, 2026 · 72 points on HN · via api-hn

Intermediate

From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI

software-engineering how-we-work foundational opinion

As AI generates code faster than teams can understand it, traditional technical debt isn't the only concern — cognitive debt (team understanding erosion) and intent debt (missing rationale for decisions) become critical risks. This framework challenges teams to think beyond code quality and consider how AI affects shared understanding and knowledge capture. Essential reading for engineering leaders navigating the balance between AI velocity and long-term maintainability.

Takeaways

AI-generated code creates new forms of debt beyond traditional technical debt that can silently undermine team effectiveness.
Cognitive debt occurs when team understanding erodes faster than code accumulates, making future changes increasingly risky.
Intent debt — the absence of captured rationale — becomes critical when both humans and AI agents need to work safely with existing code.

from Mar 29, 2026 · via suggestion

Intermediate

Pi: The Minimal Agent Within OpenClaw

agents software-engineering how-we-work

Pi represents a minimalist approach to coding agents that focuses on doing fewer things extremely well rather than trying to be a general-purpose assistant. The author argues this constraint-driven design offers a glimpse into how production coding agents should be built — with clear boundaries and specific capabilities rather than attempting to solve every development task.

Takeaways

Minimalist agent design with clear constraints may be more effective than general-purpose coding assistants.
Focused agents that excel at specific tasks could be the future of AI-assisted development workflows.

from Mar 29, 2026 · via suggestion

Accessible

Coding agents for data analysis

agents software-engineering how-we-work

Comprehensive workshop content demonstrating practical applications of coding agents for data analysis workflows. Covers real-world use cases like database querying, data exploration, and cleaning tasks using Claude Code and OpenAI Codex. Extremely valuable for engineers building data analysis pipelines with LLMs, providing concrete examples and methodologies rather than theoretical frameworks.

Takeaways

Coding agents excel at automating data analysis workflows including database querying, exploration, and cleaning tasks.
Claude Code and OpenAI Codex provide practical frameworks for building data analysis pipelines with concrete implementation examples.
Workshop-style learning with real use cases is more valuable than theoretical frameworks for implementing coding agents.

from Mar 23, 2026 · via rss-willison

Intermediate

An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

Ravish Gupta

agents security how-we-work

Demonstrates a production-ready multi-agent architecture that cuts cybersecurity risk assessment costs from $15,000 to near-zero while maintaining 85% agreement with certified practitioners. The six-agent system uses persistent shared context to build comprehensive assessments in under 15 minutes. This is an excellent blueprint for building multi-agent systems that tackle expensive professional services.

Takeaways

A six-agent architecture reduced cybersecurity risk assessment costs from $15,000 to near-zero while maintaining 85% agreement with certified practitioners.
Multi-agent systems with persistent shared context can complete complex professional assessments in under 15 minutes.
This architecture provides a blueprint for replacing expensive professional services with coordinated AI agents.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.20131

Accessible

Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

Maximiliano Armesto

software-engineering agents how-we-work

A rare longitudinal field study tracking real software modernization projects using human-AI collaboration across three major migrations. Shows concrete metrics: portfolio delivery time dropped from 36 project-weeks to 9.3, with modeled person-day savings of 73%. This provides actual evidence for AI productivity claims in enterprise software delivery, not just individual task benchmarks.

Takeaways

Real software modernization projects using human-AI collaboration reduced delivery time from 36 project-weeks to 9.3 with 73% person-day savings.
This provides concrete evidence for AI productivity claims in enterprise software delivery beyond individual task benchmarks.
Successful human-AI collaboration in software delivery requires orchestrated workflows, not just individual AI tool adoption.

from Mar 23, 2026 · via api-arxiv · arXiv:2603.20028

Accessible

Ask HN: AI productivity gains – do you fire devs or build better products?

Bleiglanz

how-we-work software-engineering opinion

A candid Hacker News discussion on the real productivity impacts of AI coding tools, moving beyond hype to practical experience. The author reports massive gains for boilerplate, libraries, and refactoring work while questioning long-term claims for complex enterprise systems. Valuable for understanding the actual developer experience and managing realistic expectations about AI-assisted development.

Takeaways

AI coding tools show massive productivity gains for boilerplate, libraries, and refactoring work but mixed results for complex enterprise systems.
Managing realistic expectations about AI-assisted development requires understanding the gap between hype and practical developer experience.
Teams should focus AI adoption on well-defined, repetitive coding tasks rather than complex architectural decisions.

from Mar 23, 2026 · via api-hn