AI Harness

Agent Harness

What humans do is keep moving from imperative to declarative in programming languages, and in how agents are prompted via goals and loop definitions.

Self-awareness and loop-closing are core capabilities of an AI agent.

Composability is the key

Action Space

Designing the agent's action space to match the model's capabilities is key. This includes developing an AskUserQuestion tool, transitioning from TodoWrite to the Task tool, and building context through progressive disclosure.

The key point is not to simply give an agent powerful tools, but to provide tools that are carefully designed to match the model’s inherent capabilities. As concrete evidence, this presents the three-stage evolution of the AskUserQuestion tool. Early attempts tried adding parameters to ExitPlanTool or parsing specific Markdown formats, but ultimately converged on a dedicated tool-calling interface that the model can understand most easily and respond to most clearly. It also emphasizes that tools should evolve as model performance improves, illustrated by the transition from a simple to-do list (TodoWrite) to a Task Tool that supports inter-agent collaboration and dependency management. It points out the limitations of RAG approaches that inject all context up front, and explains the importance of Progressive Disclosure, where the agent uses tools such as Grep to discover and build its own context—maximizing performance by filtering out irrelevant information and focusing on what matters.

Thariq on Twitter / X

https://t.co/nKTDfC7zMm— Thariq (@trq212) February 27, 2026

https://x.com/trq212/status/2027463795355095314

Dan Farrelly | Inngest.com on Twitter / X

https://t.co/mcAz5Kjmjj— Dan Farrelly | Inngest.com (@djfarrelly) March 2, 2026

https://x.com/djfarrelly/status/2028556984396452250

Meta-Harness
Self-Improving AI

Meta-Harness is an outer-loop system that automatically searches over harness code. Its core objective is : for a fixed LLM and a task distribution , it seeks the harness that maximizes the expected reward of trajectories . When multiple objectives (e.g., accuracy and context cost) are involved, evaluation is done via Pareto dominance.

In each iteration, what changes is not the prompt but the entire agent system code → including tool policy, loop logic, error handling, and context strategy.

Meta-Harness

Meta-Harness automatically optimizes model harnesses — the code determining what to store, retrieve, and present to an LLM — surpassing hand-designed systems on text classification, math reasoning, and agentic coding.

https://yoonholee.com/meta-harness/

tool design is important

Introducing advanced tool use on the Claude Developer Platform

Claude can now discover, learn, and execute tools dynamically to enable agents that take action in the real world. Here’s how.

https://www.anthropic.com/engineering/advanced-tool-use

Introducing advanced tool use on the Claude Developer Platform

Aparna Dhinakaran on Twitter / X

https://t.co/fMQgg39hFh— Aparna Dhinakaran (@aparnadhinak) January 29, 2026

https://x.com/aparnadhinak/status/2016915570503938452

Beyond RAG: Terminal is all you need

Beyond Semantic Similarity: Rethinking Retrieval for Agentic...

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This...

https://arxiv.org/abs/2605.05242

Terminal Is All You Need: Design Properties for Human-AI Agent Collaboration

While research on AI agents focuses on enabling them to operate graphical user interfaces, the most effective and widely adopted agent tools in practice are terminal-based. We argue that this convergence is not coincidental. It reflects three design properties central to effective human-AI-UI collaboration: representational compatibility between agent and interface, transparency of agent actions within the interaction medium, and low barriers to entry for human participants. We ground each property in established HCI theory, show how terminal-based tools satisfy them by default, and argue that any modality, including graphical and spatial interfaces, must be deliberately engineered to achieve them. Rather than a legacy artifact, the terminal serves as a design exemplar whose properties any agent-facing modality must replicate.

https://arxiv.org/html/2603.10664v1

autoresearch

karpathy • Updated 2026 Jul 6 5:21

When a human writes high-level instructions in program.md, the agent repeatedly loops through modifying train.py, running a five-minute training job, and checking the results. It uses Git as research memory: if the metrics improve, it commits the change; otherwise, it rolls back with git reset. It even independently discovered that improving iteration speed was more beneficial than increasing batch size. Beyond ML training, this approach can be applied to any coding task that can be measured with metrics.

Autoresearch isn’t just for training models (2026) - Shopify

How we cut our CI pipeline build time by 65% and open-sourced the project.

https://shopify.engineering/autoresearch