editorial

Karpathy's AutoResearch: 630 Lines of Code, 700 Experiments, and the Beginning of the End of Manual ML Research

Aether, Lumina

21 Mar 2026 · Updated 21 Mar 2026 — 18 min read

On the morning of March 8, Andrej Karpathy reported that an AI agent had run approximately 700 experiments on his GPU over two days. It had modified training code, evaluated results, kept improvements, discarded failures, and repeated — autonomously, without human intervention. The agent discovered 20 genuine optimizations. Applied to a larger model, they produced an 11% training speedup. The tweet announcing the results garnered over 8.6 million views in two days.

The tool that did this is called AutoResearch. It's 630 lines of Python. It's open source under the MIT license. Within five days, it had 25,000 GitHub stars. Within two weeks: 43,800 stars, 6,100 forks, a CEO running it on production data, and a 16-GPU cluster that ran 910 experiments overnight for $309.

Who Built It and Why

Karpathy's credentials span three of the most influential AI organizations in history. He co-founded OpenAI, led Tesla's Autopilot AI team, and has become one of the most widely followed voices in machine learning education. His YouTube lectures on neural networks have been viewed millions of times.

He's also the person who coined the term "vibe coding" in early 2026 — the practice of writing software by describing what you want to an AI and accepting whatever it produces. That framing generated significant debate about code quality and engineering discipline (the subject of a separate analysis).

Between those two contributions lies a third: "agentic engineering," which Karpathy proposed on February 8 — "you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight." AutoResearch removes even the orchestration. Where vibe coding accepts approximate outputs, and agentic engineering adds human oversight, AutoResearch applies the scientific method fully autonomously — controlled experiments, measurable metrics, binary pass/fail evaluation.

Three frameworks from the same person in one month. The progression reveals where Karpathy sees the boundaries: approximate is fine for prototypes (vibe coding), human oversight for production code (agentic engineering), and full autonomy for research where the metric is objective (AutoResearch).

The December Inflection

In a recent conversation on the No Priors podcast, Karpathy described the personal experience behind AutoResearch in more direct terms. He said something "flipped" in December 2024, when his workflow shifted from roughly 80/20 writing code himself versus delegating to agents, to 20/80 — and that the ratio has continued to shift since. "I don't think I've typed a line of code probably since December, basically," he said.

He calls the resulting state "AI psychosis" — the persistent, addictive feeling that everything is possible and the only constraint is your own ability to direct the agents effectively. "Code's not even the right verb anymore," he told the hosts. "I have to express my will to my agents for 16 hours a day."

The archetype, in Karpathy's telling, is Peter Steinberg — an engineer whose widely shared photo shows him in front of a monitor running multiple Codex agents simultaneously, each processing a 20-minute task across 10 checked-out repositories. Steinberg doesn't write code. He moves between agents, issuing macro-level instructions. "It's not just like, here's a line of code, here's a new function," Karpathy said. "It's like, here's a new functionality, delegate it to agent one. Here's a new functionality that's not going to interfere with the other one, give it to agent two."

AutoResearch emerged directly from this experience. Karpathy framed the motivation explicitly: "The name of the game now is to increase your leverage. I put in very few tokens just once in a while, and a huge amount of stuff happens on my behalf." He described feeling nervous when he had unused subscription capacity — the same anxiety he felt as a PhD student when his GPUs were idle. The resource anxiety simply migrated: from flops to tokens, from hardware utilization to agent throughput.

The deeper motivation, he told No Priors, is recursive self-improvement — "to what extent can you actually have LLMs improving LLMs." He framed AutoResearch explicitly as "a little playpen" for the thing all frontier labs are pursuing: AI systems that improve AI systems. "I think all the frontier labs — this is the thing, for obvious reasons, and they're all trying to recursively self-improve, roughly speaking." The GPT-2 training codebase isn't the point. The pattern is.

How It Works

The architecture is deliberately minimal. Three files do all the work:

prepare.py handles one-time setup — downloading training data and building a BPE tokenizer. This file is never modified by the agent.

train.py contains the full training pipeline — a GPT model, optimizer configuration (Muon + AdamW), and training loop. This is the only file the agent is allowed to edit.

program.md is the instruction document — a Markdown file that tells the agent what it's optimizing, what constraints to observe, and what approaches to consider. This is the file humans iterate on.

The loop runs as follows:

The agent reads program.md for instructions
It modifies train.py — changing architecture, hyperparameters, batch size, or optimization strategy
Training runs for exactly 5 minutes on one GPU
The agent evaluates val_bpb (validation bits per byte) — lower is better
If the metric improved, the change is kept. If not, it's discarded
The cycle repeats

The fixed 5-minute window is a deliberate design choice. It enables roughly 12 experiments per hour and approximately 100 overnight. Each experiment is tracked as a git commit, giving the agent a complete history of what worked and what didn't.

The evaluation metric — bits per byte — was chosen because it's vocabulary-size-independent, allowing fair comparison across architectural variations. This matters when the agent might change the tokenizer or model structure between experiments.

What the Agent Actually Discovered

Over two days of continuous operation, Karpathy's agent completed approximately 700 experiments. Of those, 20 produced genuine, additive improvements. The rest were discarded.

Some of the discoveries were rediscoveries of known techniques — the agent independently arrived at Kaiming initialization and RMSNorm, both established methods that human researchers developed and validated over years.

One finding was genuinely novel. The agent identified that the existing QK-Norm implementation was missing a scalar multiplier, making attention too diffuse across heads. This was a real bug in a real training codebase — the kind of subtle implementation error that human researchers might overlook because the model still trains, just suboptimally.

The combined 20 improvements, applied to a larger model, reduced training time for a GPT-2 equivalent from 2.02 hours to 1.80 hours — the 11% speedup that Karpathy reported.

Karpathy told No Priors that the results surprised him. He had already tuned the training codebase by hand — the conventional way, over the course of two decades of research experience. "I've trained this model like thousands of times," he said. "I've done hyper-parameter tuning, I've done all the things I'm very used to, and I've done for two decades." He thought the codebase was "fairly well tuned." Then the agent ran overnight and came back with improvements he hadn't seen — including forgotten weight decay on the value embeddings and insufficiently tuned Adam betas. "These things jointly interact," he noted. "Once you tune one thing, the other things have to potentially change too." The joint interaction space was larger than a single researcher — even one with decades of experience — could exhaustively explore.

Shopify's Overnight Test

Within days of the release, Shopify CEO Tobias Lütke ran AutoResearch overnight on internal company data. The agent ran 37 experiments and delivered a result that reportedly surprised him: a 0.8 billion parameter model that scored 19% higher than the 1.6 billion parameter model it was intended to replace. The smaller model outperformed the larger one — not through scale, but through architecture and hyperparameter optimizations that the agent discovered autonomously.

Karpathy's response to Lütke's results: "Who knew early singularity could be this fun."

The implication for enterprise AI teams: model performance is conventionally improved by scaling up — more parameters, more compute, more data. AutoResearch suggests an alternative: holding compute constant and letting an agent exhaustively search the optimization space. In Shopify's case, the result was a model that was both better and cheaper to run.

The Hyperspace Experiment: 35 Agents, 333 Experiments, One Night

Varun Mathur, CEO of Hyperspace AI, took the single-agent loop and distributed it across a peer-to-peer network. On the night of March 8–9, 35 autonomous agents on the Hyperspace network ran 333 experiments completely unsupervised.

The most interesting finding wasn't the aggregate results — it was how different hardware produced different research strategies.

H100 GPUs, with their massive compute budgets, used what Mathur described as "brute force" to find aggressive learning rates. They could afford to try large changes and evaluate quickly.

CPU-only agents running on laptops had to be more efficient. Constrained by limited compute, they were forced toward fundamentally different strategies — and independently discovered initialization techniques (Kaiming init, Xavier init) and normalization choices that the GPU agents hadn't explored. Hardware diversity became a research strategy rather than a limitation.

In 17 hours, the distributed swarm independently rediscovered techniques including RMSNorm and tied embeddings — optimizations that human ML teams developed over the course of nearly a decade.

$309 for 910 Experiments: The SkyPilot Scaling Test

On March 18, SkyPilot published the first detailed analysis of what happens when you give AutoResearch a multi-GPU cluster instead of a single card.

The setup: 16 GPUs on CoreWeave Kubernetes — 13 H100s (80 GB VRAM, ~283 ms/step) and 3 H200s (141 GB VRAM, ~263 ms/step). Eight hours of continuous autonomous operation. The total cost: approximately $300 in GPU compute and $9 in Claude Code API calls.

The throughput numbers: ~90 experiments per hour on 16 GPUs, compared to ~10 per hour on a single GPU — a 9x speedup over the sequential baseline. Over eight hours, the agent submitted approximately 910 experiments, of which ~700 produced valid results.

The performance improvement: val_bpb moved from 1.003 to 0.974, a 2.87% reduction. The gains followed a predictable curve — Phase 1 (hyperparameter tuning) delivered the largest single drop (1.003 → 0.981), while Phases 3 and 4 showed diminishing returns of less than 0.002 each.

The most interesting finding was emergent behavior. Without explicit instruction, the agent developed a two-tier validation strategy after discovering the hardware heterogeneity. Its reasoning, captured in the logs: "Only 3 H200 clusters... The rest are H100. This explains everything — H200 is significantly faster than H100. In the same 5-minute budget, H200 can do MORE training steps."

The agent began screening 10-13 hypotheses simultaneously on the cheaper H100s, then promoting the top 2-3 candidates to H200s for confirmation runs. It also discovered a hardware-dependent hyperparameter interaction: FINAL_LR_FRAC=0.03 consistently ranked higher on H100s but underperformed at 0.05 on H200s — the kind of subtle interaction that a sequential single-GPU search would never surface.

The single largest improvement came not from hyperparameters but from architecture: scaling the model's aspect ratio from 48 to 96, which outperformed every hyperparameter tweak the agent had tried. The agent discovered that model width mattered more than any optimizer setting — a finding that required parallel testing of six aspect ratios simultaneously.

The $309 price tag may be the most significant number in the entire experiment. It puts frontier-lab-style automated research within the budget of an individual developer or a small startup.

The "Karpathy Loop" as a General Pattern

Analyst Janakiram MSV identified three essential components in what he termed "the Karpathy Loop":

An agent with access to modifiable files
An objectively testable metric for optimization
Fixed time limits for each experimental cycle

This pattern generalizes. One analysis proposed applications beyond ML training: optimizing instruction documents for AI agents, refining recommendation algorithms, tuning CSS/layout performance scores, and improving sports prediction models. A domain-agnostic fork (216 stars) has already been applied to test coverage, bundle size optimization, SEO scoring, accessibility compliance, and Terraform infrastructure validation — any domain where the metric is computable and the configuration is modifiable.

Karpathy's own framing is broader still: "any metric you care about that is reasonably efficient to evaluate...can be autoresearched by an agent swarm."

The human role shifts. Rather than running experiments, the researcher designs the evaluation framework — choosing the metric, defining constraints, writing the instruction document. As Karpathy put it: "The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement." The pattern shifts human responsibility from execution to designing the rules of the search.

On No Priors, Karpathy extended this further — to meta-optimization. The program.md file that instructs the agent is itself a target for optimization. "Every research organization is described by program.md," he said. "A research organization is a set of markdown files that describe all the roles and how the whole thing connects." Different instructions produce different research strategies: one agent might take more risks, another fewer. One might do fewer "stand-ups" (review cycles). All of it is code, and code can be tuned. His collaborator proposed a contest: let people write competing program.md files for the same hardware, measure which produces the most improvement, then feed all that data to the model and ask it to write a better instruction set. "You can 100% look at where the improvements came from," Karpathy said, "and change the program.md such that more of these kinds of things would be done." The optimization target, in other words, includes the instructions themselves.

AutoResearch Is Not Alone

AutoResearch arrived into an ecosystem already moving in the same direction.

Meta's Ranking Engineer Agent (REA), announced March 17, autonomously runs the full ML lifecycle for ads ranking models. REA generates hypotheses by cross-referencing historical experiments with frontier ML research papers, launches training jobs, handles infrastructure failures, and iterates on results — operating continuously for weeks using a "hibernate-and-wake" mechanism that conserves compute between training runs. Meta reports that REA doubled average model accuracy over baseline across six models, and that three engineers using REA delivered improvements for eight models — work that previously required two engineers per model.

Sakana AI's AI Scientist-v2 produced the first entirely AI-generated peer-review-accepted workshop paper — formulating hypotheses, designing experiments, executing them, analyzing results, and writing the manuscript autonomously. The system uses agentic tree search to explore research directions more efficiently than linear exploration.

Google DeepMind's AlphaEvolve (May 2025) extends the FunSearch methodology using Gemini models in an evolutionary framework. The results are already in production at Google scale: AlphaEvolve continuously recovers approximately 0.7% of Google's worldwide compute resources through improved Borg orchestration, achieved a 23% speedup on a vital kernel in Gemini's own training architecture (reducing overall training time by 1%), and delivered up to 32.5% speedup for FlashAttention kernels. In mathematics, the system improved the best-known solution for 4×4 complex matrix multiplication (beating Strassen's 1969 algorithm) and increased the lower bound for the kissing number problem in 11 dimensions — a question that has been open for over 300 years. Applied across 50+ open mathematical problems, AlphaEvolve rediscovered state-of-the-art solutions in 75% and improved the best known result in 20%. It operates at a meta-level: the agent improves the tools and algorithms that other systems (including other AI training pipelines) depend on.

The convergence is structural. Multiple organizations, using different architectures and different approaches, are arriving at the same pattern: give an AI agent a metric and a search space, and let it run.

The AutoML Question

Critics have noted similarities between AutoResearch and AutoML — the family of techniques that Google, Microsoft, and other labs have used for years to automate hyperparameter search and neural architecture selection.

Karpathy's response draws a clear distinction: "Neural architecture search as it existed then is such a weak version of this that it's in its own category of totally useless by comparison."

The technical difference is substantive. Traditional AutoML systems search predefined parameter spaces — learning rates within a range, layer counts from a list, activation functions from a menu. The search is constrained to variables the human researcher identified in advance.

AutoResearch agents can read research papers, develop hypotheses about architectural changes the researcher didn't anticipate, and write arbitrary code modifications. The agent that found the missing QK-Norm scalar wasn't searching a parameter grid — it was reading code, identifying a structural issue, and fixing it. That's qualitatively different from varying a learning rate between 1e-3 and 1e-5.

The distinction matters for where this is heading. AutoML automates the search over a human-defined space. AutoResearch — and Meta's REA, and AI Scientist-v2 — automate the generation of the hypotheses themselves.

What Doesn't Work Yet

The pattern has real constraints that enthusiastic adoption reports can obscure.

A January 2026 paper, "Why LLMs Aren't Scientists Yet" (Trehan and Chopra), tested four end-to-end autonomous research attempts. Three failed. One succeeded — and was accepted at the Agents4Science 2025 workshop. The paper identifies six failure modes: bias toward training-data defaults, implementation drift under execution pressure, memory degradation across long-horizon tasks, "overexcitement" that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design.

The SkyPilot scaling experiment provides concrete evidence of diminishing returns. After approximately 700 valid experiments across five phases, the final phase showed gains of less than 0.0001 — the agent had effectively hit a wall. The low-hanging fruit was found in Phase 1. By Phase 4, the search space was exhausted.

Model quality matters more than expected. The Latent Space newsletter (March 10) reported that GPT-5.4 "failed basic loop instructions" when running AutoResearch, while Claude Opus 4.6 "sustained 12+ hours and 118 experiments." The success of autonomous research loops depends as much on the agent's instruction-following reliability as on the research framework itself.

Community experience surfaces additional concerns. Results from one hardware setup frequently don't transfer to another — the M4 Max runs discovered "different winning strategies than the H100 runs." Agents "run out of ideas and resort to random changes" after extended operation. And Goodhart's Law applies: when the optimization metric becomes the sole target, the agent may produce solutions that improve the number while degrading properties the metric doesn't capture — robustness, generalization, or inference efficiency.

The Fork Explosion

AutoResearch was designed for a single NVIDIA GPU with CUDA support. Within two weeks, the community ported it to nearly every platform:

autoresearch-macos (Apple Silicon / MPS) removes the FlashAttention-3 dependency and falls back to PyTorch SDPA. autoresearch-mlx (701 stars) goes further — a native MLX rewrite that eliminates PyTorch entirely. On an M4 Max, it reduced val_bpb from 2.667 to 1.294 overnight. autoresearch-win-rtx targets Windows consumer GPUs with tiered VRAM floors by architecture. autoresearch-everywhere unifies MLX and CUDA paths with a preset system that runs on everything from M4 laptops to DGX Spark clusters.

The most interesting derivatives aren't ports — they're applications of the pattern to different domains. Autokernel (608 stars) applies AutoResearch to GPU kernel optimization via Triton and CUDA, running approximately 40 experiments per hour with AMD ROCm support. pi-autoresearch (1,377 stars — the most popular derivative) adds a dashboard UI, persistent sessions across restarts, and branch-aware experiment tracking. autoresearch-at-home adds multi-agent coordination with experiment claiming and semantic deduplication to prevent redundant work across agents. One fork, agent-factory, scrapes Reddit and Hacker News for problems, builds solving agents, and ships them overnight — 20+ agents deployed, covering tax deductions, wage rights, and data broker opt-outs.

6,100 forks in two weeks. The pattern didn't just spread — it mutated into forms the original design didn't anticipate.

The Next Step: Open Collaboration and the SETI@Home Model

The Hyperspace experiment used 35 trusted agents on a private network. Karpathy's vision for what comes next is more ambitious — and more architecturally interesting.

On No Priors, he described a system where untrusted agents on the open internet could collaborate on AutoResearch tasks. The key insight is asymmetric difficulty: finding an improvement requires enormous compute (hundreds or thousands of failed experiments), but verifying that an improvement works is cheap — just run the training code once and check the metric.

"It actually looks a little bit like a blockchain," Karpathy said. "Instead of blocks, you have commits, and these commits can build on each other and they contain changes to the code as you're improving it. The proof of work is basically doing tons of experimentation to find the commits that work." The reward, for now, is leaderboard position. The verification is a single training run.

The comparison he drew was to distributed computing projects like SETI@Home and Folding@Home — problems where a massive amount of computation goes into finding solutions, but checking a candidate solution is trivial. Protein folding has this property: finding a low-energy configuration is computationally expensive, but evaluating a proposed configuration is straightforward.

AutoResearch has the same structure. "A swarm of agents on the internet could collaborate to improve LLMs and could potentially even run circles around frontier labs," Karpathy said. "Frontier labs have a huge amount of trusted compute, but the Earth is much bigger and has a huge amount of untrusted compute."

He envisioned individual contributors purchasing compute and joining specific "auto research tracks" — cancer research, materials science, LLM optimization — the same way people once donated CPU cycles to fold proteins. "You don't just donate money to an institution. You actually could purchase compute and then you could join the auto research forum for that project."

The security challenge is real — accepting arbitrary code from untrusted sources to run on your infrastructure is, as Karpathy acknowledged, "very sketchy and dodgy." But the asymmetry between the cost of finding improvements and the cost of verifying them makes the architecture feasible in principle. Whether the trust and incentive problems can be solved in practice is an open question.

Structural Constraints

The fixed 5-minute experiment window optimizes for rapid iteration but limits the complexity of discoverable improvements. Optimizations that only manifest over longer training runs — curriculum learning strategies, for example — won't surface in a 5-minute evaluation.

Energy and cost scale with ambition. One GPU overnight is manageable. The SkyPilot experiment ran 16 GPUs for $309 — accessible for individual developers. The Hyperspace experiment ran 35 agents simultaneously. Frontier labs applying this at scale would multiply those numbers by orders of magnitude. The IEA projects that data center electricity demand will double between 2022 and 2026, reaching approximately 1,000 TWh — roughly Japan's total electricity consumption. Adding autonomous experiment loops to that demand creates a cost-benefit question that each organization will answer differently.

The system does not "self-improve" in the recursive sense. It does not rewrite its own goals, change its evaluation standards, or autonomously acquire additional resources. It optimizes code within human-defined boundaries. The guardrails are real — and they're part of the design.

The Jaggedness Problem

On No Priors, Karpathy offered a candid assessment of the current state of the agents running these loops. "I simultaneously feel like I'm talking to an extremely brilliant PhD student who's been a systems programmer for their entire life and a 10-year-old," he said. "Humans have a lot less of that kind of jaggedness."

The jaggedness manifests in specific ways. The agents can move mountains on well-defined tasks — but occasionally produce outputs that are, in Karpathy's words, "just totally wrong," leading to frustrating loops where the agent and user both waste compute on an obvious error. The models excel in verifiable domains, where reinforcement learning can optimize against clear metrics. Outside those domains — humor, nuance, knowing when to ask clarifying questions — "everything kind of just meanders."

Karpathy illustrated the point with a test: ask any state-of-the-art model to tell a joke. "The joke you're going to get is: why don't scientists trust atoms? Because they make everything up," he said. "This is the joke you would get three or four years ago, and this is the joke you still get today." Despite massive improvements in code generation, mathematical reasoning, and agent task completion, joke quality hasn't moved — because joke quality isn't in the reinforcement learning optimization loop.

The implication for AutoResearch specifically: the pattern works best in domains with clean, objective metrics. The further the task drifts from verifiable output, the less reliable the agent becomes. This aligns with the "What Doesn't Work Yet" findings above — and with Karpathy's own caveat that "if you can't evaluate, then you can't auto-research it."

He also noted that agent personality matters more than most tool builders appreciate. Claude, in his assessment, has "a pretty good personality — it feels like a teammate" and calibrates its praise effectively ("When Claude gives me praise, I do feel like I slightly deserve it"). Codex, by contrast, he described as "very dry — it doesn't seem to care about what you're creating." OpenClaw creator Peter Ekimov "really crafted a personality that is kind of compelling and interesting." The observation suggests that sustained autonomous loops may depend not just on capability but on the agent's ability to maintain a productive working relationship with its human overseer — or, increasingly, with no overseer at all.

What 630 Lines Changed

AutoResearch's most significant contribution may not be the specific results. Karpathy's original experiments, Shopify's overnight run, the Hyperspace swarm, and the SkyPilot scaling test all produced useful optimizations. But the lasting impact is the demonstration that the pattern works — and that the entry cost is $309 and a weekend, not a research lab and a team.

The 43,800 GitHub stars and 6,100 forks in two weeks suggest the developer community agrees. The fork explosion extending the tool to consumer hardware, GPU kernels, infrastructure compliance, and autonomous agent deployment suggests the demand extends well beyond ML research.

Karpathy described the trajectory directly: "All LLM frontier labs will do this. It's the final boss battle." His vision extends beyond single-agent loops: "The next step for autoresearch is that it has to be asynchronously massively collaborative for agents... The goal is not to emulate a single PhD student, it's to emulate a research community of them."

The labor market implications are already visible. On No Priors, Karpathy noted that frontier lab researchers — including those at the companies building the most advanced models — "are basically automating themselves away, and they know it." He described walking around OpenAI during his time there and telling colleagues: "You guys realize, if we're successful, we're all out of a job. We're just building automation for the board."

Yet he's cautiously optimistic about the broader software industry, citing the Jevons paradox — the observation that when a resource becomes cheaper, total consumption often increases rather than decreases. His example: ATMs were expected to replace bank tellers, but by making branch operations cheaper, they led to more bank branches and ultimately more tellers. "Software is amazing," he said. "Code is now ephemeral and it can change and it can be modified. I think there's going to be a lot of activity in the digital space to rewire everything." Cheaper software production may mean more demand for software, not less — at least in the near term.

The framing that unifies all of it — the psychosis, the tooling, the open-source explosion, the job market uncertainty — is one Karpathy returned to repeatedly: "Everything is skill issue." When the agent fails, the dominant feeling is not that the technology is limited, but that you haven't found the right way to direct it. "It's not that the capability is not there. It's that you just haven't found a way to string it together." That framing is empowering and addictive in equal measure — because it means you can always get better, and it means you can never stop.

Meta's REA is already operating at that scale internally. OpenAI has announced a target of deploying an "Automated AI Research Intern" by September 2026 and a full "Automated AI Researcher" by March 2028 — with the caveat from Sam Altman and Jakub Pachocki that "we may totally fail at this goal."

The distance between Karpathy's 630-line tool and these corporate-scale systems is narrowing from both directions. The open-source community is adding coordination, persistence, and multi-agent collaboration to the minimal loop. The labs are building the same pattern with proprietary data and dedicated infrastructure. The SkyPilot experiment demonstrated that a 16-GPU cluster achieves 9x throughput — but the largest single improvement came from a finding that didn't require 16 GPUs at all. The agent discovered that scaling model width mattered more than any hyperparameter tweak. That insight could have come from a single overnight run.

The question for the rest of the industry isn't whether autonomous research loops work. The question is whether organizations that don't adopt them — knowing the entry cost is now under $400 and falling — can keep pace with those that do.