Research Engineering Agents
Terminal agents are the path to scalable applied research engineering.
Over one year ago, I wrote a blogpost on the need for agent data. Environments were an academic topic then, and a thriving topic now with major data providers invested in their creation and maintenance and an ecosystem of startups trying to solve the problem.
A lot has changed. The last six months, in particular, has been a whirlwind of improvements in training and inference for oss lms. We now can apply high-throughput, async policy gradient to rather strong language models. We have GEPA, which applies evolutionary search/hybridization to quickly infer reams of lm instructions from production task data. And we are starting to see incremental adoption of the 2010s RL stack - dagger, expert iteration, etc etc. We also have quantization, speculative decoding, new and better kernels.
We have seen some limited successes with companies publicly deploying RL’ed models. But, it’s still early days, somehow. Whereas there was a big overhang between the types of AI products companies could build, and what they in fact deployed, a year ago, it is positively massive today.
Why? Well, I’ve asked this question to many people. The answer I keep getting is, that talent remains as scarce as ever. The answer I’ve inferred from talking to devs working on agents, is that training deployment frictions are too high, and expertise and best practices too fragmented. Teams don’t know what experiments to run, and known bets require too much investment in time, energy, and money.
So, we have two problems. Teams struggle to propose promising experiments, and struggle to efficiently execute them.
The first problem is this: to propose promising experiments in an applied setting, you need to understand the distribution of data that the system will see, understand the software defining the system and how it may change, understand the end-user’s priorities and the developer’s priorities, AND also understand the tradeoffs our panoply of methods offer in terms of payoff and investment needed. That is, proposing experiments is a product and software question, perhaps more so than a frontier research question.
The second is problem, on the other hand, is absolutely a research/data/systems problem. To effectively execute an experiment, you want strong infra (async policy gradient has much more bang for buck than does sync), you want strong abstractions that enable you to manipulate data with sql and python, and you want effective ways to store and process that data before, during, and after the experiment. This is a problem we’ve been focused on for the last few months. There’s lots of progress to yet make, but we and the sector broadly have come quite a ways since 2024.
If one prices is algorithmic improvements for the rest of 2026, and I do, then the second problem becomes mostly - how do we make these methods maximally available to practitioners solving the first problem? The most efficient experiment possible is worthless if it remains hypothetical and worth little if it’s poorly evaluated, maintained, and actioned. These desiderata will take time to nail but essentially are a readout of how problem one gets solved.
I’ll cut to the chase. Insofar as problem 1 is a problem of software and product, there is one very very obvious solution. Terminal agents. No other form factor combines distilled practical knowledge, an effective human interface, and SDLC quite like terminal agents do. And research engineering for LMs sorely needs all three.
This is not news to us. This was our initial medium-term pitch and we spent cycles prototyping a version of a terminal agent for improving AI systems in March -> June earlier this year, around when Terminal Bench released. There were a few issues then - OSS models were weak at coding, we didn’t have smart abstractions for them to use, and it was hard for us to scale recipes for learning from data, as optimizing AI systems is almost a quintessentially niche and OOD task due to the intersection of frontier techniques and closed-source target code.
We now have models like Qwen-Coder and GLM 4.6 that are strong enough for Cursor and Windsurf to invest RL into, and compelling enough that we get substantial inbound asking for help RLing them.
We have very strong OSS harnesses for coding agents like Codex and OpenCode. It’s much easier for us to post-train coding agents, now. It’s much easier for us to expose experimentation infrastructure like supervised finetuning, policy gradient, distillation, gepa, and more via a cli interface, and to deploy these systems in a semi asynchronous manner. AI is all about timing, and all the pieces both internal and external to Synth are in place for research engineering agents to work.
Don’t take my word for it. OpenAI is pulling the trigger on their research agent project. While it’s good to have one’s own convictions about the long-run, timing is often evident to at least a closed internal consensus.
It’s time to seize the opportunity.
——
Building research engineering agents won’t prevent us from investing in our data and algorithms platform, nor discourage us from making our tools human-usable. Research experiments have the curious property of being expensive and requiring humans to take bets, so that option is foreclosed. And, a better experimentation platform essentially directly translates into a more effective research assistant. We just think we’ll finally be able to create the right platform, co-designed for use by the right profile - an assistant-aided engineering team.