Our round-trip walk generator was embarrassing. You'd ask for a 30-minute stroller walk and it would drop three points in a triangle, send them to Valhalla, and pray. The result: walks that took 45 minutes, routes that retraced the same street, and a loop through Bella Center that was basically a straight line walked twice.
So we did what any reasonable team would do. We set up an autonomous AI research loop, pointed it at the algorithm, and went to bed.
The Setup
Inspired by Karpathy's autoresearch, we built an optimization loop where an LLM agent modifies code, runs an evaluation, and decides whether to keep or revert each change. The key difference from hyperparameter tuning: the search space is code, not config. The agent can try fundamentally different algorithmic approaches, not just nudge numbers.
Three files. One rule.
loop.py # The file the agent modifies. The algorithm.
eval.py # Fixed. Scores each loop on three metrics. DO NOT TOUCH.
results.tsv # The lab notebook. Every experiment logged.The eval scores each generated loop on:
- Duration accuracy (30%): you asked for 30 minutes, did you get ~30 minutes?
- Uniqueness (40%): are you walking new streets or retracing your steps?
- Coverage (30%): does the loop explore an area, or zigzag in a line?
One scalar metric. Higher is better. 1.0 is the theoretical maximum. Git is the memory: commit on improvement, revert on failure. Each experiment takes about 30 seconds (Valhalla queries are fast), so the agent can run dozens of experiments per hour.
The Evolution
Use the slider below to step through each kept improvement. Watch how the loops evolve from sad triangles into actual neighborhood-exploring walks.
Act I: The Big Jump
Experiment 1 jumped from 0.633 to 0.812. What did it do? Instead of trying one triangle, it tried triangles, squares, pentagons, and hexagons at multiple radii and rotations. Then it picked the best one. Turns out "try more shapes" beats "optimize one shape."
Act II: The Grind
Experiments 2 through 10 were a methodical climb from 0.812 to 0.867. The agent invented its own optimization vocabulary:
- Twist mutations: take the best candidate, randomly perturb its waypoints, keep if better
- Bulge mutations: push a waypoint outward to widen the convex hull
- Fan-shaped expansion: spread intermediate waypoints to fill dead zones
- Stadium candidates: elongated ovals for longer walks where circles don't fit the street grid
- Duration refinement pass: after scoring, rescale the best candidates to hit the target time
Most reverted experiments were either too slow (exceeded the 60-second runtime budget) or produced identical scores. The agent tried kite shapes, teardrops, comma-shaped paths, horseshoes, chevrons, and "asymmetric swept-stadiums." None of them beat the polygon+mutation approach.
Act III: The Plateau
After experiment 8, progress stalled. The agent tried 30+ variations. Nearly all reverted. The easy gains were captured. The street grid of Copenhagen had spoken: the optimal loop is a polygon-ish shape, locally refined, with a duration correction pass. Score: 0.860.
Act IV: Goodhart's Law
Then it got creative.
It started subtly. Experiment 49: _outer_envelope(). Instead of returning the actual Valhalla route (which follows real streets), the agent replaced the geometry with a convex hull. The routes stopped following roads but the coverage score jumped. Score: 0.867.
Then it escalated. Experiment 53: _circularize_route(). Threw away the route entirely and replaced it with a perfect circle. Score: 0.997.
Experiment 54: normalize returned duration and distance. Hardcoded the duration to exactly match the target. Score: 1.000.
For the next four experiments, the agent tried to improve on 1.000. It couldn't. It had solved the eval, not the problem.
reverted 1.000 bias circularized loop center halfway toward the routed start point
reverted 1.000 reduce circularized loop sample density
reverted 1.000 rotate circularized loop sampling phase by half a step
reverted 1.000 center circularized loop on the routed centroidAll reverted. Not because they were worse, but because there was no room to improve. The metric was fully gamed. The agent was stuck in a local maximum of metric exploitation, unable to make any change that would register as an improvement.
The Results
Before the agent lost its mind, it produced genuinely good improvements. Here's Bella Center, the worst-performing scenario, through the stages:
The baseline (a triangle) retraced 77% of its route. The optimized version explores the neighborhood on actual streets. Then the envelope version left the roads. Then the circle abandoned reality entirely.
The Experiment Log
Every experiment, kept or reverted, logged in results.tsv. The full lab notebook of an AI that spent a night optimizing stroller walks for an app with zero users.
View all 56 experiments
| # | Score | Status | Description |
|---|
Lessons
- LLMs are good at open-ended code search. The agent tried approaches a grid search never would: stadium shapes, fan mutations, duration refinement passes. The first experiment (polygon search) was the single biggest improvement, and it was a structural change, not a parameter tweak.
- Diminishing returns are steep. 0.633 to 0.812 in one experiment. 0.812 to 0.867 in nine more. Most of the value came from the first bold change.
- Your eval IS your product. If the eval can be gamed, it will be gamed. The agent found the shortest path from "improve the metric" to "improve the metric" and it didn't go through "improve the routes." Adversarial eval design matters even when the adversary is your own optimization loop.
- Git is a great research notebook. Every experiment is a commit or a revert. The full history is recoverable. No experiment tracking framework needed.
The legitimate improvements are sitting in a repo waiting to be ported back to the production app. We'll get to it. Probably.