Building an Evaluation Harness with Claude Code

I've been messing about with a project called AssetForge (project page) , an attempt to get Claude to generate pixel art sprites through geometric drawing commands. The sprites are for retro games targeting classic Mac System 6, 7, 8, and 9. The generation itself turned out to be hit and miss, but building the evaluation harness to measure sprite quality taught me a lot about how LLM judges actually work. And along the way I ended up with some genuinely useful tools for creating PICT resources that run on real vintage hardware. If you just want to see some pixel art, there's a full sprite gallery with everything the harness produced.

#What is an evaluation harness

An evaluation harness is a system for measuring the quality of LLM output in a repeatable way. Instead of eyeballing results and thinking "yeah, that looks alright," you define specific quality dimensions, score them against a rubric, and track how those scores change as you tweak your prompts and pipeline.

The basic idea: you have a generator (the LLM producing output) and a judge (another LLM scoring that output against defined criteria). You run the generator on a set of test prompts, the judge scores each result, and you get back a report card. Change something in your pipeline, run it again, compare the numbers. Anthropic have a good guide on defining success criteria and building evaluations, and their engineering blog covers evals for AI agents in more depth.

The catch is that your judge needs to be trustworthy. If the judge is handing out 5/5 to everything, the numbers are meaningless. So before you can use the harness to improve your generator, you first need to calibrate the judge against human ratings. That calibration process was most of the work.

#Why I needed one

AssetForge generates pixel art sprites by asking Claude to produce geometric drawing commands (rectangles, circles, polygons) which get rasterised onto a grid. The output is a JSON object with a colour palette and a 2D array of palette indices. Think of it as the LLM writing painting instructions rather than painting directly.

Early on I'd built a tank game prototype and the sprites looked decent. Chunky tanks, ground tiles, obstacles. But when I tried to expand to more complex subjects (dragons, knights, vehicles from different angles) quality was inconsistent. Some sprites were great, others were unrecognisable blobs. I needed a way to measure quality systematically so I could figure out what to improve and whether changes actually helped.

#How the harness works

The pipeline runs like this:

Load a set of test prompts (30 in the final version, covering vehicles, characters, items, terrain, obstacles, and creatures)
For each prompt, call Claude Sonnet to generate drawing commands and a colour palette
Rasterise the commands into a pixel grid using a custom rasteriser that supports rectangles, circles, ellipses, lines, polygons, and flood fills
Render a 4x scaled PNG of the result for the judge to look at
Score the sprite on 6 quality dimensions using Claude Opus as judge
Output a JSON report with per-sprite scores and overall averages

The main evaluation loop handles all of this, including multi-variant generation where it creates several versions of each sprite and picks the best one.

The six quality dimensions:

Each sprite gets scored 1-5 on six dimensions, defined in the judge's rubric:

Component Separation: are the distinct parts (tracks, hull, turret) visually separate?
Colour Usage: do the colours create readable depth and contrast?
Detail Density: how rich is the visual detail? (This one was code-based, counting drawing commands)
Spatial Coverage: does the sprite fill the grid appropriately for its subject?
Pixel Art Discipline: clean edges, intentional pixel placement, no artifacts?
Prompt Adherence: does the result actually look like what was asked for?

Each dimension is judged in a separate LLM call running in parallel, with its own rubric and scoring anchors. This mattered because it meant I could debug and fix one dimension at a time without affecting the others.

#Calibrating the judge

This was the real work. Five rounds of calibration, each one identifying a root cause and fixing it.

Round 1: The judge is broken. Ten sprites, Sonnet generating and Sonnet judging. Human average: 3.8/5. LLM average: 4.9/5. The judge gave 5/5 for prompt adherence to 9 of 10 sprites. A full point of leniency bias across every dimension. The judge was reading the drawing command intent ("this rectangle is a tank hull") rather than evaluating the visual result.

Round 2: Rewrite the rubrics. Same 10 sprites, updated rubrics. I added explicit language like "judge the pixels, not the code" and "a person unfamiliar with the prompt should be able to identify the subject." Prompt adherence improved from 1.30 to 0.80 MAD (mean absolute difference from human scores). But spatial coverage got worse because code-based pixel counting doesn't match how humans judge composition.

Round 3: Separate the models. Twenty sprites, Claude Opus as judge instead of Sonnet. Anthropic's recommendation is to not use the same model to evaluate its own output. The hypothesis was that Sonnet was recognising its own "handwriting" in the drawing commands. Bias shrank but two dimensions stayed stubborn.

Round 4: Give the judge vision. This was the biggest single improvement. Up to this point, the judge had never actually seen the rendered sprite. It was scoring based on the JSON drawing commands alone, understanding the intent of each rectangle but never seeing what the pixels actually looked like when composed together.

I added three things: the rendered PNG sent as vision content, an ASCII pixel grid as text, and pre-check exercises. For prompt adherence, the judge now had to write down what it saw in the image before looking at the prompt, a "first impression" test. Overall MAD dropped to 0.68.

Round 5: Fix the infrastructure. Instead of trying to improve the generation prompt further (v2 and v3 both made things worse, more on that below), I improved everything around it. Expanded to 30 prompts with richer descriptions and difficulty tiers. Moved spatial coverage from code-based to LLM-judged. Matched grid sizes to complexity (32x32 for simple items, 64x64 for complex characters). Spatial coverage went from the worst-calibrated dimension (0.95 MAD) to the best (0.61 MAD, 96% agreement with human scores).

#The sprites

Here's a selection of what the harness produced. The scores are human ratings, not the LLM judge's.

The best, simple subjects with clear shapes (4.8-5.0/5):

Wizard sprite, 5.0 out of 5 — Wizard (5.0/5)

Health potion sprite, 5.0 out of 5 — Health potion (5.0/5)

Golden crown sprite, 5.0 out of 5 — Golden crown (5.0/5)

Boulder sprite, 5.0 out of 5 — Boulder (5.0/5)

Peasant farmer sprite, 4.8 out of 5 — Peasant farmer (4.8/5)

Terrain tiles, consistently the strongest category (4.7/5 average):

Grass tile, 5.0 out of 5 — Grass (5.0/5)

Desert tile, 5.0 out of 5 — Desert sand (5.0/5)

Cobblestone road tile — Cobblestone road

Cobblestone road from round 4, 5.0 out of 5 — Cobblestone (R4) (5.0/5)

The middle ground, recognisable but rough (3.0-4.0/5):

Giant green creature sprite — Giant green creature

Dwarf blacksmith sprite, 3.0 out of 5 — Dwarf blacksmith (3.0/5)

The worst, vehicles and complex multi-part subjects (2.3-3.0/5):

Military tank v3 sprite, 2.3 out of 5 — Tank (v3) (2.3/5)

Blue pickup truck sprite, 2.3 out of 5 — Pickup truck (2.3/5)

Red sports car sprite, 3.0 out of 5 — Sports car (3.0/5)

Knight sprite, 3.0 out of 5 — Knight (3.0/5)

Wooden barrel sprite, 2.7 out of 5 — Barrel (2.7/5)

The archive sprites, from early coding sessions before the eval harness existed:

Yellow tank from archive — Yellow tank (archive)

Red tank from archive — Red tank (archive)

Ground tile from archive — Ground tile (archive)

There's something interesting about the archive sprites. They were generated during early Claude Code sessions where I was building the tank game prototype. Simpler prompts, no rubrics, just "make a big yellow tank with chunky tracks." They look noticeably better than some of the eval harness output that used much more detailed prompts. The simplest prompting approach often produced the best results.

#What worked and what didn't

What worked:

Blind human grading. I scored sprites without seeing what the LLM judge had given them, so I wouldn't anchor to its scores.
Model separation. Opus judging Sonnet's output, not its own. This reduced the self-evaluation bias.
Vision. Letting the judge see the rendered pixels instead of just reading drawing commands dropped prompt adherence MAD by 0.16 in one round. Biggest single win.
Per-dimension judging. Separate LLM calls for each dimension meant I could identify exactly which scoring aspect was broken and fix it independently.
Infrastructure over instructions. Better test prompts, appropriate grid sizes, and multi-shot generation helped more than trying to write a perfect generation prompt.

What didn't work:

More prescriptive generation prompts. I tried two variants. V2 added perspective rules and size guidance, and overall quality dropped by 0.16. V3 emphasised detail and added category-specific instructions, and quality dropped by 0.45. More constraints made the model worse, not better. The baseline prompt that just said "break into 3-5 parts, shade with 3+ layers, use 40-80 commands" outperformed both.
Code-based detail density. Counting drawing commands doesn't equal quality. Fifty rectangles could be rich texture or fifty overlapping blobs.
Prompt adherence calibration. Despite everything (vision, pre-checks, explicit rubric language) the judge still over-credits intent over result. A sprite whose commands describe a knight but whose pixels look like a blob gets LLM 4, human 2. This gap never fully closed.

The stubborn problem: intent vs reality. The LLM judge reads intent while humans judge the visual result. The drawing commands describe a tank with tracks flanking a hull and a turret on top. The judge sees that description and gives credit. The human sees the rendered pixels and thinks "that's a green blob." Adding vision helped a lot, but richer, harder prompts in later rounds kept exposing the gap. The judge is pattern-matching against intent at a level humans don't, and I never found a way to fully close that.

#The useful bits: PICT tools

The sprite quality may have a ceiling, but the tooling that came out of the project is the bit I'm most pleased with. I ended up with three C tools for getting pixel art onto classic Macs:

grid2pict takes a JSON sprite (palette + pixel grid) and writes a valid PICT 2.0 file with PackBits compression. System 7 and up can open these natively.

pict2macbin wraps PICT files in MacBinary II format with a proper resource fork. Multiple PICTs can go into a single resource file with custom IDs and names.

picts2dsk creates an HFS disk image (.dsk file) containing PICT files, using a bundled libhfs library. Mount the disk image in an emulator or write it to a floppy and the files are right there.

The pipeline works. Generate a sprite, convert to PICT, wrap in MacBinary or write to a disk image, and open it on a real Mac (or an emulated one). Here's a sprite opened in SimpleText on a classic Mac running in QemuMac, my QEMU-based classic Mac emulation setup:

A generated sprite opened in SimpleText on System 7, running in QEMU. The sprite was generated with Claude, converted to PICT format, and saved to the shared drive of the virtual machine.

#Takeaways

If you're doing anything iterative with LLM output, an eval harness is worth building. The five rounds of calibration sound like a lot of work, but each round was a focused session with Claude Code. The harness paid for itself almost immediately. Without it I'd have been making changes and guessing whether they helped. With it, I could see that v2 was -0.16 and v3 was -0.45 within an hour of running the eval. That saves days of wasted iteration.

My advice if you're building one: start with blind human grading of a small set (10-20 examples), use a different model for judging than generating, give the judge vision if you're working with anything visual, and score dimensions independently so you can debug them one at a time.

As for generating pixel art this way, the model is good at filling grids with texture (terrain tiles score 4.7/5) and producing simple iconic shapes (potions, keys, crowns). It struggles with complex silhouettes, spatial relationships between components, and anything that needs precise anatomy at low resolution. It doesn't have a spatial canvas in its head. It's producing a sequence of shape descriptions hoping they compose into something recognisable. Sometimes they do, often they don't.

If I were starting again, I'd skip the LLM-to-drawing-commands approach for complex subjects. A better route would be AI image generation to produce high-resolution reference art, then deterministic tooling to convert each frame down to indexed-colour pixel art using the grid format. Image generation models are good at "what should it look like" and conversion tooling can be precise about "how to represent it as pixels." The PICT conversion pipeline from AssetForge would slot right into that.

The C tools for creating PICT resources are the bit I'll keep using. Whatever the source of the pixel art (hand-drawn, AI-generated, converted from high-res images) the pipeline for getting it onto a classic Mac in the right format works well and I'll use it for future retro game projects.