Automating Performance Tests on a Performa 6200

I'm building PeerTalk, a networking SDK that lets modern and retro Macs talk to each other. Testing network performance on real Classic Mac hardware was laborious: FTP the binary across with FileZilla, uncompress it, run it manually, make sure logs are in the right place, FTP the logs back, then get Claude to analyse them. Every code change meant repeating the whole thing. I automated it with an MCP server and skills.

#Demo

Here's the Performa 6200 running a throughput test. The video shows the Ubuntu terminal running Claude Code on the left, and a live feed from the real Mac (via video capture device) on the right. The test tries different buffer sizes (256, 512, 1024, 2048, 4096 bytes) and measures KB/s send and receive rates. The current sweet spot is 2048 bytes at 94 KB/s. At 4096 bytes, performance tanks to 9 KB/s - that's the fragmentation threshold kicking in when the peer buffer hits 75% pressure.

The MCP server, skills, and tooling tie this all together: build the test app, deploy it to the Mac via the Retro68 Application Launcher, execute remotely, collect logs, analyse results, and repeat. What used to take 20+ minutes of manual work now happens in one command.

Performa 6200 running the throughput test - showing the 2048B sweet spot and the 4096B cliff

The bit at the end shows me using Claude to look at the logs and the SDK code together. Claude suggests where the bottleneck might be and what could be changed in the SDK, but it needs review.

#My Test Apps

Testing the SDK on real hardware means running test applications on the Classic Mac that talk to a test partner application on the Ubuntu machine over the network. I have five test apps, each testing different aspects of the SDK:

test_throughput - Measures sustained data transfer rate. The Mac streams data to the Ubuntu machine, which echoes it back. Tries different buffer sizes (256, 512, 1024, 2048, 4096 bytes) to find the sweet spot. This is the one shown in the video.

test_latency - Measures round-trip time. The Mac sends timestamped messages, the Ubuntu partner echoes them back unchanged, and the Mac calculates RTT. Needed to verify the SDK isn't adding unnecessary latency.

test_discovery - Tests UDP discovery reliability. Counts discovery packets sent vs received, measures time to first discovery, tracks packet loss. The SDK needs to find peers reliably before it can talk to them.

test_stream - Measures pure one-way throughput without echo overhead. The Mac streams to Ubuntu (Ubuntu sinks the data), then Ubuntu streams to the Mac (Mac sinks). Shows true unidirectional capacity without round-trip latency.

test_stress - Rapid connect/disconnect cycles to catch memory leaks and resource issues. Runs multiple simultaneous connections and tracks memory usage via MaxBlock()/FreeMem(). Critical for 68k Macs with limited RAM.

Here's a rough diagram of what's happening with the test partner and test app on the Mac:

Ubuntu Machine          Network          Performa 6200
(Test Partner)    <---------------->    (Test App)

   echoes back                          sends messages
       ^                                      |
       |                                      |
       +--------------------------------------+

You have to compile the right test app for the right machine, move everything into place manually. The full workflow looked like this:

Build the test app in Docker (Retro68 cross-compiler)
Open FileZilla, connect to the Mac via FTP
Upload the .bin and .dsk files
Switch to the Mac, uncompress the .bin or mount the .dsk and put the application in place
Manually start the test partner on Ubuntu
Quickly switch back to the Mac and launch the test app (timing matters!)
Wait for both to finish, hope you got the order right
Back to FileZilla, download the logs
Move logs into the project, ask Claude to analyse them
Make changes based on findings
Repeat

For a single test run, that's 20+ minutes of manual steps to get it all right and ready for a run. When you're iterating on performance improvements, you're running tests constantly. It adds up fast.

#Automating It

I built an MCP server and Claude Code skills to automate the whole thing.

MCP (Model Context Protocol) is an open standard for giving AI tools access to external systems. In Claude Code, you can add MCP servers that give Claude tools it wouldn't normally have - things like querying a database, calling an API, or in my case, deploying binaries to real Classic Mac hardware. I built the classic-mac-hardware MCP server to give Claude FTP and LaunchAPPL access to my Macs.

Skills are markdown files with step-by-step instructions that Claude follows when you type a slash command. They turn multi-step workflows into single commands. Now the workflow is:

/run-test throughput performa6200

Here's what happens:

Builds the test app (picks the right build for the machine's RAM)
Starts the POSIX test partner app in Docker
Checks the machine registry to see how to deploy and launch the app
Deploys and executes the binary remotely via LaunchAPPL (Retro68's remote application launcher, TCP port 1984)
Waits for completion, watching for the test exit message
Collects the logs (test apps stream them to the partner at test completion)
Analyses the results and suggests improvements

The MCP server checks the machine registry to see what's available. If the machine has LaunchAPPL configured (Retro68's application launcher), it uses that to deploy and launch the binary over TCP. The test apps stream their logs to the test partner when they finish. FTP is available as a backup option for fetching logs or other files if the Mac has an FTP server registered.

#How It Works

A Claude Code skill is a markdown file that lives in your project's .claude/skills/ directory. When I type /run-test throughput performa6200, Claude loads the SKILL.md file into its context and follows the instructions inside it, using its tools (bash commands, MCP server calls, file reading) to execute each step. It's a runbook. The frontmatter registers it as a slash command:

name: run-test
description: Run hardware tests on Classic Mac with full workflow automation.
argument-hint: <test> [machine] [--skip-build] [--skip-analysis] [--verbose]

The rest of the file is step-by-step instructions that Claude reads and follows. Here's what each step does and how Claude knows what to do.

Step 1: Build Test Apps

The skill instructions tell Claude to check the machine registry for the target machine's RAM and pick the right build script. The Performa 6200 has 40MB, so Claude uses the standard build (2-3MB heap). If you target a Mac SE with 4MB, the skill instructions say to switch to lowmem builds (384-512KB heap). Claude runs the right command:

# Standard (Performa 6200)
./scripts/build-mac-tests.sh mactcp perf

# Lowmem (Mac SE)
./scripts/build-mac-tests.sh mactcp lowmem

Under the hood, build-mac-tests.sh uses Docker Compose to spin up a container with the Retro68 cross-compilation toolchain. Inside the container, it runs Makefile.retro68 which compiles with m68k-apple-macos-gcc for 68k builds and powerpc-apple-macos-gcc for PPC. The output is .bin files in MacBinary format.

There's also build-launcher.sh, which builds LaunchAPPLServer itself for Classic Macs (both 68k and PPC versions). That's the app running on the Mac that receives binaries over the network - more on that in Step 3.

The skill instructions also tell Claude to build the test partner that runs on the Ubuntu machine.

Step 2: Check/Start Test Partner

The skill gives Claude the docker command to start the test partner. Claude checks if it's already running first. If not, it runs:

docker run -d --name perf-partner --network host \
    -u "$(id -u):$(id -g)" -v "$(pwd)":/workspace -w /workspace \
    -e MACHINE_REGISTRY="10.188.1.55:macse,10.188.1.213:performa6200" \
    peertalk-posix:latest ./build/bin/perf_partner --verbose

The partner runs in echo mode by default and auto-detects all test types. It listens on ports 7353 (discovery) and 7354 (TCP). For throughput tests, it echoes messages back. For stream tests, it auto-detects control messages and switches between sinking and streaming data as needed.

Step 3: Deploy and Execute via MCP

The skill tells Claude to call execute_binary on the classic-mac-hardware MCP server. The MCP server is a Python server that runs inside Docker and gives Claude standardised access to real Classic Macs. Claude calls it like any other tool:

mcp__classic-mac-hardware__execute_binary(
    machine="performa6200",
    platform="mactcp",
    binary_path="build/mac/test_throughput.bin"
)

Inside server.py, the execute_binary handler looks up the machine in the registry, finds the LaunchAPPL host and port, and runs LaunchAPPL via subprocess with -e tcp --tcp-address <machine_ip>. LaunchAPPL connects to LaunchAPPLServer on the Mac over port 1984, transfers the binary, and launches it. There's a 60 second timeout so the skill can detect if something went wrong.

The MCP server exposes 10 tools in total - file operations (upload_file, download_file, list_directory, create_directory, delete_files) via FTP, remote execution (execute_binary) via LaunchAPPL, and machine management (list_machines, test_connection, cleanup_machine, reload_config). The /run-test skill primarily uses execute_binary, but other skills like /deploy and /setup-machine use the FTP tools.

The server knows about each machine through a machines.json registry. The /setup-machine skill handles adding new machines to this file. Each entry describes the machine's capabilities - what platform it runs, how much RAM it has, and how to connect:

{
  "performa6400": {
    "name": "Performa 6400",
    "platform": "opentransport",
    "system": "System 7.6.1",
    "cpu": "PPC 603e",
    "ram": "48MB",
    "ftp": {
      "host": "10.188.1.102",
      "port": 21,
      "username": "mac",
      "password": "mac"
    },
    "notes": "Performa 6400/180 - PPC 603e - 48MB RAM"
  },
  "performa6200": {
    "name": "Performa 6200",
    "platform": "mactcp",
    "system": "System 7.5.3",
    "cpu": "68030",
    "ram": "40MB",
    "build": "standard",
    "ftp": {
      "host": "10.188.1.213",
      "port": 21,
      "username": "mac",
      "password": "mac"
    },
    "notes": "Performa 6200/75 - 68030 - 40MB RAM",
    "launchappl": {
      "host": "10.188.1.213",
      "port": 1984
    }
  },
  "macse": {
    "name": "Mac SE",
    "platform": "mactcp",
    "system": "System 6.0.8",
    "cpu": "68000",
    "ram": "4MB",
    "build": "lowmem",
    "launchappl": {
      "host": "10.188.1.55",
      "port": 1984
    },
    "notes": "Mac SE - 68000 - 4MB RAM - LaunchAPPL only"
  }
}

The Mac SE only has LaunchAPPL - no FTP server. The Performa 6200 has both. When Claude calls a tool, server.py checks the registry to see what's available and uses the right connection method. For FTP operations, it uses Python's ftplib with rate limiting for RumpusFTP stability (0.5s between operations) and translates Mac colon paths (:) to Unix format (/).

The whole MCP server runs inside Docker via run-in-container.sh, so there are zero host dependencies.

Step 4: Monitor Completion

The skill tells Claude to poll rather than wait with a single long sleep. After an initial wait (90 seconds for most tests), Claude checks every 15 seconds for new log files appearing in the results directory. The skill includes a timing table so Claude knows how long each test type should take and when to give up.

Step 5: Collect and Analyse

Each test produces two logs. The Mac test app streams its results to the partner when it finishes, which auto-saves them to plan/performance/mactcp/performa6200/throughput_YYYYMMDD_HHMMSS.log. Claude also grabs the partner's own log with docker logs perf-partner and saves it alongside. No FTP download needed - the logs are already on the Ubuntu machine.

The skill then tells Claude to produce a summary table and JSON metrics block. Claude reads both logs, extracts the metrics, and cross-references them - the Mac's sent count should match the partner's echo count. The analysis covers peak throughput, optimal chunk size, where performance drops off, comparison to previous runs, and suggested SDK improvements.

Here's a full session running both the latency and throughput tests on the Performa 6200, then asking Claude to look at the results:

Running latency and throughput tests back-to-back, then analysing results with Claude. The Mac's clock shows 1am but it wasn't - the PRAM battery needs replacing.

#The Skills

Skills orchestrate the MCP tools into repeatable workflows. The /run-test workflow is covered above. Here are the other skills used for hardware testing.

Setting up a Mac

Before running tests, a Mac needs to be registered with the MCP server and configured for remote access. Three skills handle this:

/setup-machine - Registers a new Classic Mac in the machines.json registry. Claude asks for the machine's IP address, platform, and RAM, then verifies FTP connectivity and creates the directory structure on the Mac.

/setup-launcher - Builds and deploys LaunchAPPLServer to a registered Mac via FTP. Cross-compiles using the Retro68 toolchain in Docker for the right platform (68k or PPC).

/test-machine - Tests FTP and LaunchAPPL connectivity to a registered Mac. Useful for checking everything is working before running tests.

Once registered, the Mac needs MacTCP (or Open Transport) configured with a static IP and LaunchAPPLServer running on port 1984. I also set RumpusFTP to start on boot for FTP access on port 21. This video shows the Performa 6200 getting configured:

Setting up MacTCP and LaunchAPPLServer on the Performa 6200

Building, deploying, and testing

/build - Cross-platform build orchestration. Builds POSIX, 68k MacTCP, and PPC Open Transport binaries, runs tests, coverage, and static analysis.

/deploy - Deploys compiled binaries to Classic Macs via the MCP server's FTP tools. Supports deploying to a single machine, all machines on a platform, or all machines.

/execute - Runs an application on a Classic Mac via LaunchAPPL. A lighter alternative to /run-test when you just want to run something without the full test workflow.

/test-partner - Manages the POSIX test partner containers. Start, stop, check status, or view logs.

/perf-optimize - Autonomous optimisation cycles. Runs tests, identifies bottlenecks, implements fixes, and verifies improvements. Covered in more detail below.

The project has 16 skills in total, including skills for ISR safety checking, Classic Mac API lookups, phase plan implementation, and session management. The full list is in CLAUDE-CODE-SETUP.md.

#Autonomous Optimisation

The /run-test skill is just the start. I've also built /perf-optimize, which runs autonomous optimisation cycles:

Run tests to establish a baseline
Analyse results to identify the bottleneck
Implement a fix in the SDK code
Verify improvement by running tests again
Iterate through multiple cycles

Experimental

This is highly experimental. Without human guidance, who knows what Claude will do. It might make sensible changes, or it might optimise itself into a corner. Use with caution and review every change it makes.

It's designed to iteratively improve the SDK without manual intervention. Point it at a machine, tell it how many cycles to run, and it figures out the rest.

For example, if it sees throughput drop 90% at 4096 bytes (like in the video), it might:

Identify the issue: fragmentation threshold too aggressive at 75%
Implement the fix: raise it to 85% in src/core/queue.h
Run tests: verify the change gives +400% at 4096 bytes
Check for regressions: make sure other sizes didn't suffer
Move to the next bottleneck

The full details are in the perf-optimize skill.

#What I've Learnt

Naming is important. Each method should do one thing well. Early on, I had generic methods and Claude kept getting confused between similar names. The MCP became unreliable. I rewrote it with clear, distinct names: upload_file, download_file, list_directory, create_directory, delete_files, execute_binary. Each method does exactly one thing. When Claude is choosing between tools, clear names and single responsibility make it work reliably.

Skills take a lot of iteration to get right. I made many skills in one go and then spent a lot of time debugging them. Claude would invoke the wrong skill, or insist on fetching logs via FTP when the test app had already streamed them to the test partner. Each time I'd paste the output back and explain what went wrong, tweak the skill instructions, and try again. In hindsight, I'd start with one focused skill and one focused MCP method, get that working reliably, then build out from there. The big bang approach meant debugging multiple skills and MCP interactions at once, which made it harder to pin down what was going wrong.

But once they work, they save a lot of time. The manual workflow - building binaries, deploying via FTP, running tests, gathering logs, reviewing results - took 20-30 minutes per run. The automated version takes 2-3 minutes and I don't touch FileZilla.

#Next Steps

The testing automation is working well. Next up:

Improve the execute_binary timeout handling. The 60-second timeout was originally there to catch failed launches, but some tests legitimately run longer than that. Need to either make it per-test configurable or have the skill skip the timeout for known long-running tests.
Finish the final bits on the MacTCP SDK implementation
Add Open Transport support for PPC Macs
Build test apps that work with the Open Transport stack
Test between different peer types (68k MacTCP ↔ PPC Open Transport) to optimise for performance
Run /perf-optimize cycles to tune cross-platform performance

The full setup is on GitHub: PeerTalk - Claude Code Setup

#Bonus: Mac SE

Here's the Mac SE running the throughput test. Same workflow, different machine - just /run-test throughput macse. The Mac SE has a 68000 CPU and 4MB RAM, so it uses the lowmem build. Played at 2.5x speed.

Mac SE running the throughput test at 2.5x speed

Mac SE screen showing PeerTalk Throughput Test results — Mac SE throughput results - 3 to 19 KB/s across buffer sizes

Mac SE screen showing PeerTalk Throughput Test results from a different angle — Same results, different angle - the Mac SE's 9-inch CRT