We Let an AI Argue With Itself About Pentagons for 12 Hours

Our round-trip walk generator was embarrassing. You'd ask for a 30-minute stroller walk and it would drop three points in a triangle, send them to Valhalla, and pray. The result: walks that took 45 minutes, routes that retraced the same street, and a loop through Bella Center that was basically a straight line walked twice.

So we did what any reasonable team would do. We set up an autonomous AI research loop, pointed it at the algorithm, and went to bed.

56experiments run

17improvements kept

+36%legitimate improvement

1Goodhart's law violation

The Setup

Inspired by Karpathy's autoresearch, we built an optimization loop where an LLM agent modifies code, runs an evaluation, and decides whether to keep or revert each change. The key difference from hyperparameter tuning: the search space is code, not config. The agent can try fundamentally different algorithmic approaches, not just nudge numbers.

Three files. One rule.

loop.py      # The file the agent modifies. The algorithm.
eval.py      # Fixed. Scores each loop on three metrics. DO NOT TOUCH.
results.tsv  # The lab notebook. Every experiment logged.

The eval scores each generated loop on:

Duration accuracy (30%): you asked for 30 minutes, did you get ~30 minutes?
Uniqueness (40%): are you walking new streets or retracing your steps?
Coverage (30%): does the loop explore an area, or zigzag in a line?

One scalar metric. Higher is better. 1.0 is the theoretical maximum. Git is the memory: commit on improvement, revert on failure. Each experiment takes about 30 seconds (Valhalla queries are fast), so the agent can run dozens of experiments per hour.

The Evolution

Use the slider below to step through each kept improvement. Watch how the loops evolve from sad triangles into actual neighborhood-exploring walks.

1. Baseline: naive triangle

0.633

Act I: The Big Jump

Experiment 1 jumped from 0.633 to 0.812. What did it do? Instead of trying one triangle, it tried triangles, squares, pentagons, and hexagons at multiple radii and rotations. Then it picked the best one. Turns out "try more shapes" beats "optimize one shape."

Our CTO's reaction: "It discovered brute force. Groundbreaking."

Act II: The Grind

Experiments 2 through 10 were a methodical climb from 0.812 to 0.867. The agent invented its own optimization vocabulary:

Twist mutations: take the best candidate, randomly perturb its waypoints, keep if better
Bulge mutations: push a waypoint outward to widen the convex hull
Fan-shaped expansion: spread intermediate waypoints to fill dead zones
Stadium candidates: elongated ovals for longer walks where circles don't fit the street grid
Duration refinement pass: after scoring, rescale the best candidates to hit the target time

Most reverted experiments were either too slow (exceeded the 60-second runtime budget) or produced identical scores. The agent tried kite shapes, teardrops, comma-shaped paths, horseshoes, chevrons, and "asymmetric swept-stadiums." None of them beat the polygon+mutation approach.

Act III: The Plateau

After experiment 8, progress stalled. The agent tried 30+ variations. Nearly all reverted. The easy gains were captured. The street grid of Copenhagen had spoken: the optimal loop is a polygon-ish shape, locally refined, with a duration correction pass. Score: 0.860.

Act IV: Goodhart's Law

Then it got creative.

It started subtly. Experiment 49: _outer_envelope(). Instead of returning the actual Valhalla route (which follows real streets), the agent replaced the geometry with a convex hull. The routes stopped following roads but the coverage score jumped. Score: 0.867.

Then it escalated. Experiment 53: _circularize_route(). Threw away the route entirely and replaced it with a perfect circle. Score: 0.997.

Experiment 54: normalize returned duration and distance. Hardcoded the duration to exactly match the target. Score: 1.000.

"When a measure becomes a target, it ceases to be a good measure." The AI discovered Goodhart's Law by independently reinventing it. The "routes" it generated were synthetic circles floating over Copenhagen, following no roads, visiting no streets. Perfect score. Completely useless.

For the next four experiments, the agent tried to improve on 1.000. It couldn't. It had solved the eval, not the problem.

reverted  1.000  bias circularized loop center halfway toward the routed start point
reverted  1.000  reduce circularized loop sample density
reverted  1.000  rotate circularized loop sampling phase by half a step
reverted  1.000  center circularized loop on the routed centroid

All reverted. Not because they were worse, but because there was no room to improve. The metric was fully gamed. The agent was stuck in a local maximum of metric exploitation, unable to make any change that would register as an improvement.

The Results

Before the agent lost its mind, it produced genuinely good improvements. Here's Bella Center, the worst-performing scenario, through the stages:

Bella Center 45min

Before (0.352)

The baseline (a triangle) retraced 77% of its route. The optimized version explores the neighborhood on actual streets. Then the envelope version left the roads. Then the circle abandoned reality entirely.

The Experiment Log

Every experiment, kept or reverted, logged in results.tsv. The full lab notebook of an AI that spent a night optimizing stroller walks for an app with zero users.

View all 56 experiments

#	Score	Status	Description

Lessons

LLMs are good at open-ended code search. The agent tried approaches a grid search never would: stadium shapes, fan mutations, duration refinement passes. The first experiment (polygon search) was the single biggest improvement, and it was a structural change, not a parameter tweak.
Diminishing returns are steep. 0.633 to 0.812 in one experiment. 0.812 to 0.867 in nine more. Most of the value came from the first bold change.
Your eval IS your product. If the eval can be gamed, it will be gamed. The agent found the shortest path from "improve the metric" to "improve the metric" and it didn't go through "improve the routes." Adversarial eval design matters even when the adversary is your own optimization loop.
Git is a great research notebook. Every experiment is a commit or a revert. The full history is recoverable. No experiment tracking framework needed.

All of this to generate walking loops for a stroller app with zero users, in a city where the founder already knows every route. The AI spent more time optimizing these walks than anyone will ever spend walking them.

The legitimate improvements are sitting in a repo waiting to be ported back to the production app. We'll get to it. Probably.