I Ditched My Planning System for Fully Automated GitHub Spec Kit

I've been building PeerTalk, a networking SDK that lets modern Linux boxes talk to Classic Macs over a LAN. The project has three platform backends (POSIX, MacTCP, Open Transport), runs on hardware from 1987 to today, and needs to work within 4 MB of RAM on a Mac SE. I wrote about the Claude Code customisations and the automated hardware testing in earlier posts. This one is about what happened to the planning, how GitHub's Spec Kit replaced it, and how I ended up kicking off a build at midnight and waking up to a working SDK.

#What I built

The original PeerTalk repo accumulated quite a lot of Claude Code infrastructure. By the end I had:

14 custom skills (covered here) covering /session (navigate phase plans), /implement (orchestrate implementation sessions), /run-test (deploy and benchmark on real hardware), and more
4 hooks that blocked unsafe edits at interrupt time, ran compile checks after every save, warned about AppleTalk gotchas, and tracked test coverage
An MCP server for accessing Classic Mac hardware over the network
4 domain-specific rule files with verified citations from Inside Macintosh and Apple's networking documentation (committed to the repo as full-text files)
12 phase plans totalling 1.2 MB of specifications

It worked, I got eight of twelve phases done and all four platform implementations compiled and ran. I had real performance numbers from a Performa 6200, a Performa 6400, and a Mac SE. MacTCP was pulling 497 KB/s on stream sends. The ISR safety hooks caught dozens of interrupt-time violations before they hit hardware. The /mac-api skill could search the actual Inside Macintosh books and return line-level citations.

Along the way, the SDK accumulated features I never set out to build. A 964-line priority queue with O(1) free-lists and a coalesce hash table. A two-tier message buffer that routed large messages to a separate 4 KB direct buffer. TLV capability negotiation where peers exchanged max message sizes, preferred chunk sizes, buffer pressure, and six different capability flags after connecting. Batch send that packed multiple messages into single TCP packets. A stream transfer system for data up to 64 KB. Multi-transport peer deduplication (690 lines) for merging peers discovered on both TCP/IP and AppleTalk. An entire buffer pool pre-allocation system with three separate allocation strategies. A compact 4-byte header format negotiated via capability exchange to save 6 bytes per message. All fun to build so quickly, a nightmare to make work in harmony, and none of it needed for a Bomberman clone or a chess game.

#How I was working

The planning approach I'd developed across several projects went like this: write a PROJECT_GOALS.md describing the overall vision, then break the work into numbered phase plans (PHASE-1-FOUNDATION.md, PHASE-2-PROTOCOL.md, and so on). Each phase had ordered sessions, each session had specific tasks and verification criteria. Between me and Claude, we tracked dependencies between phases manually with mixed success. We often completed huge phases in one go and it was often up to me to decide when to commit and push.

On top of that I built custom Claude Code skills to automate the workflow. /session would parse the phase files, show which sessions were done, and find the next piece of work. /implement would spin up four parallel subagents to gather context (reading the phase plan, loading rules, checking code inventory, finding dependencies), then walk through the session tasks and run verification. /implementable, which I first wrote for an earlier project, evaluated whether a plan was suitable for Claude Code: did sessions fit context windows, were steps small enough, were there testable outputs.

I loved setting /implement off and coming back to a huge chunk of work done. But the chunks were too big, and the test apps were driving what got built. I'd run a throughput test on the Performa, see that synchronous sends topped out at 87 KB/s, and then the next session would add an async send pipeline with multi-slot pipelining to get it higher. The POSIX perf partner grew to 1,720 lines. The stream test app hit 1,123 lines. They had their own UI subsystems, log streaming infrastructure, and custom control protocols. Five Mac test apps and two POSIX ones, totalling 6,720 lines. They were supposed to be simple test harnesses.

The domain-specific .claude/rules/ files loaded automatically when Claude edited platform code. Edit a file in src/mactcp/ and the MacTCP rules would load: ASR callback safety, error codes, TCPPassiveOpen gotchas, buffer management patterns. Same for Open Transport, AppleTalk, and ISR safety. These came from reading the Apple reference books and encoding the rules as hooks that blocked unsafe edits before they happened.

Claude would read a big phase file, figure out what to do, and go. The intelligence was in the skills and rules. There wasn't much tooling between them, just Claude interpreting markdown...and as we know that interpretation is not the same every time.

#What went wrong

Two things. First, the plan files:

Phase	Purpose	Size
PHASE-0	PT_Log logging library	113 KB
PHASE-1	Foundation types	161 KB
PHASE-2	Wire protocol	100 KB
PHASE-3	Queue management	70 KB
PHASE-4	POSIX implementation	152 KB
PHASE-5	MacTCP implementation	142 KB
PHASE-6	Open Transport	180 KB
PHASE-7	AppleTalk	114 KB

Each file was 500+ lines. /session had an explicit warning not to read them directly because of token limits. It used grep to extract session boundaries and status markers. The skill that was supposed to help me navigate my plans couldn't actually read them in one go.

They got so big partly because I kept going back and improving them during implementation. Every time I hit a gotcha or learnt something about the MacTCP API, I'd update the relevant phase plan. I was adding notes and architecture reviews as separate documents too. The idea was that the plans should be good enough for someone else to pick up, or for a clean re-implementation. I wanted people to be able to clone the repo, look at the skills and phase plans, and have a working example of how to use Claude Code on a real project and just kick it off and watch it go. Probably a bit ambitious given the project requires actual retro hardware and writes C for 30-year-old platforms, but that was the thinking. It meant I was polishing planning documents instead of writing the code they were planning for. The plans became something I maintained alongside the codebase instead of a reference I could just work from.

There were also 110 files in the plan/ directory: phase plans, architecture reviews, performance logs, hardware test results, workflow guides. The repo had 50 test source files, a metrics dashboard with its own deployment workflow, Docker infrastructure, and 75 files under .claude/ for skills, hooks, rules, agents, and MCP server configuration. The skills had grown to match: /implement had five reference files of its own, /review ran six analysis passes. 397 files total. The source code alone was 62,778 lines of C and headers across six platform directories.

Second, scope creep. The test apps were the problem. I'd run a throughput test, see a number I wanted to improve, and add features to the SDK to get it higher. Each feature felt justified because the test results proved it helped. But I was optimising for benchmarks rather than building towards actual applications.

The flow control in peer.c alone had four layers: pressure-based throttling (when a peer reports high buffer pressure, skip low-priority messages), token bucket rate limiting (50 KB/s at high pressure, 100 KB/s at medium), send window management calculated from the peer's receive buffer size, and queue backpressure at 25%/50%/75%/95% thresholds with per-priority drop policies. On top of that, adaptive RTT tuning would automatically adjust chunk sizes, pipeline depth, and rate limits based on rolling round-trip time samples. RTT under 50ms? Chunk size 4096, pipeline depth 4. Over 200ms? Chunk 512, pipeline 1. Sounds clever but is completely unnecessary for sending chess moves. I shouldn't be trying to build RabbitMQ for MacOS 7.

The PeerTalk_SendEx function shows where it all came together: validate parameters, look up peer, check pressure throttling, check token bucket, check send window, check if fragmentation is needed, check if pressure-triggered fragmentation should kick in, then route to either the priority queue with coalescing or the direct buffer. That one function incorporated every scope-creep feature in the repo.

Detailed phase plans with clear session breakdowns made it easy to keep adding features, because each addition felt like a tidy, well-organised piece of work. The architecture review identified nine gaps and proposed five improvement phases. Fine. But the right response was "do any of these actually matter for the three apps I'm building?" rather than planning fixes for all nine.

#Rules without structure

The original project had a PROJECT_GOALS.md, a CLAUDE.md, and four domain-specific rule files in .claude/rules/. On paper, that looks like it should have kept things on track. In practice, enforcement was patchy at best and the structure was too loose to hold up across sessions.

PROJECT_GOALS.md was 528 lines and 20 KB. It had ASCII art architecture diagrams, six code examples showing every API usage pattern, tables for message priorities, coalescing options, transport options, per-peer statistics, resource-aware RAM tables, and library deliverables for every platform variant. It listed everything and then some: supported application types: turn-based games, real-time games, chat, file transfer, collaborative tools, custom protocols.

Worse, the goals document had already designed the API. PeerTalk_SendEx with PT_PRIORITY_REALTIME and PT_COALESCE_NEWEST flags, spec'd out in code examples. Message priorities (realtime, high, normal, low) and coalescing modes (none, newest, oldest), fully documented with enums and usage patterns. The four-layer flow control was baked into the goals from the start. I'd written an API design document, called it "project goals," and the implementation just built what the goals described. Nothing in it said what PeerTalk shouldn't do.

CLAUDE.md ran to 400+ lines of protocol constants, build commands, code quality gates, common pitfalls, hardware configuration, skill documentation. Useful for running Claude Code sessions, but it had nothing to say about scope. The .claude/rules/ files were good domain knowledge: "Don't call malloc in an ASR callback." "Clear T_UDERR or the endpoint hangs." "TCPPassiveOpen is one-shot." They stop you writing broken MacTCP code, but they had nothing to say about whether a particular piece of MacTCP code needed to exist. /implementable checked whether plans fit context windows and had testable outputs. It never asked whether the planned work belonged in the project.

And Claude doesn't remember any of this between sessions. Each new conversation starts fresh. The goals file might get loaded, or it might not. Claude might read the first 200 lines, hit a context limit, and miss the rest. During a long session, early context fades. I'd run a hardware test, see a throughput number, mention it to Claude, and the conversation would drift into optimisation work that had nothing to do with the goals. The goals document was sitting right there in the repo but Claude wasn't checking it before saying yes to my bad ideas. Nothing in the workflow forced that check to happen.

Good information, impressive tooling. But none of it was rigidly or 100% repeatable. The whole setup depended on Claude happening to do the right thing, which is what you should not depend on.

⚠

In Claude we trust... too much

The original project went off the rails partly because I trusted Claude to make scope decisions. I'd see a throughput number I didn't like, mention it in a session, and Claude would design an async pipeline with multi-slot pipelining to fix it. I'd say "the test apps need better UI" and get a 700-line table UI subsystem. Each feature felt justified in the moment because Claude's implementation was good. The code compiled, the tests passed, the benchmarks improved. But I was rubber-stamping features rather than questioning whether they belonged. Claude doesn't say "you don't need this" unless something in the project tells it to. I didn't have that something.

#What is Spec Kit

Spec Kit is an open-source toolkit from GitHub for spec-driven development. You write specifications and AI agents build the implementation. It works with Claude Code, Copilot, Cursor, Gemini, and 20+ other agents. My original approach was conversational: Claude reads a big markdown file, figures out what to do, goes. Spec Kit is more structured. Each stage has its own command, produces a specific artifact, and the next stage won't run until the previous one exists.

The pipeline:

/speckit.constitution writes a project constitution with non-negotiable principles (amendable via a versioned governance procedure, but authoritative while in force)
/speckit.specify produces a feature specification: user stories with priority levels, acceptance scenarios, functional requirements, success criteria
/speckit.clarify runs a structured ambiguity scan across ten categories and asks targeted questions
/speckit.plan generates research decisions, data models, and interface contracts (API endpoints, CLI schemas, or whatever the project exposes)
/speckit.tasks breaks the plan into a checklist of tasks with dependency markers
/speckit.analyze does a read-only consistency check across all artifacts
/speckit.implement executes the tasks phase by phase

This is where it gets interesting for me. What makes Spec Kit work where my home-rolled approach fell apart is the bash scripts underneath. Each command starts by running a script that checks what exists on disk and outputs JSON. check-prerequisites.sh looks for plan.md and tasks.md; if plan.md is missing it exits with "Run /speckit.plan first." create-new-feature.sh scans git branches and the specs/ directory to find the next feature number, creates the branch, and copies in the spec template. setup-plan.sh scaffolds the plan from a template. None of this relies on Claude reading a file and figuring out what state the project is in. The scripts check the filesystem and tell Claude what exists and where to find it.

With my old approach, /session had to parse 500-line phase files with grep to find session boundaries because it couldn't read the whole thing. Spec Kit doesn't have that problem because the pipeline splits what used to be one 180 KB phase plan into five or six focused artifacts: spec.md, plan.md, research.md, data-model.md, tasks.md. Each one has a single purpose and stays small enough for Claude to read in full. Even the largest artifact across all my peertalk specs (research.md for the full SDK rewrite) is under 1,700 lines.

The task format is enforced by the command instructions and a template. Every task looks like: - [ ] T001 [P] [US1] Description with file path. T001 is a sequential ID that never gets reused. [P] means parallelisable (touches different files, no dependencies). [US1] traces back to user story 1 in the spec. The format is strict enough for a shell script to drive: grep -c '\[x\]' tasks.md gives you done count, grep -c '\[ \]' gives you remaining. /speckit.implement batches the [P] tasks and runs them concurrently. /speckit.analyze checks that every requirement in the spec has at least one task covering it, and flags any task that doesn't map to a requirement.

The gating between stages is the thing I was trying to do manually with session tracking and dependency notes. Spec Kit handles it with file existence checks: no spec means no plan, no plan means no tasks, no tasks means no implementation. The constitution gets checked twice during planning (once before research, once after design), and /speckit.analyze treats constitution violations as automatically critical. My old approach had PROJECT_GOALS.md and 12 phase plans with manually tracked dependencies. Spec Kit replaces all of that with a few bash scripts and some consistent markdown formats.

#The constitution

The constitution is a versioned document with ten principles that gate every feature, and you write it first. Before the spec, before the plan, before any code. I love the simplicity of this document - you can scan it in 30 seconds and know all you need to know about what peertalk is.

The first principle would have prevented most of the original scope creep:

I. The Three Apps Are the Spec. Every feature MUST serve at least one target application. No adaptive throttling, no priority queues, no capability negotiation, no multi-transport peer merging. If none of the three apps need it, it does not ship.

The whole thing is 218 lines and 7 KB. One code example, ten principles, a "What Does Not Ship" list, and a three-line definition of done.

PROJECT_GOALS.md had six code examples with PeerTalk_SendEx, message priorities, and coalescing flags already designed. The constitution has one code block showing PT_Send should exist but no implementation details other than the method signature. The goals listed six supported application types without excluding anything. The constitution names three apps and drops everything else. It even has a "What Does Not Ship" section listing priority queues, capability negotiation, adaptive tuning, rate limiting, multi-transport anything, config structs with more than five fields. Every one of those was in the original API.

Principle V says pre-allocate everything. The original repo already did this, but it wasn't written down as a rule. Principle VI says adapt at init, not at runtime, which rules out the adaptive RTT tuning the original had. Principle IX caps the codebase at 15,000 lines, and the original was already at 9,000+ with four phases to go.

The constitution defines "done":

PeerTalk is done when a test app on Linux can discover and exchange messages with a test app on a Mac SE (MacTCP) and a Performa 6400 (OT). The API fits on one screen. The code is simple enough to be fun to read.

The original project had twelve phases with their own completion criteria, but nothing that said when to stop. Current state against that definition:

Mac SE (68k, MacTCP, System 6.0.8, 4 MB RAM): all four tests PASS
Performa 6200 (PPC, MacTCP, System 7.5, 8 MB RAM): all four tests PASS
Performa 6400 (PPC, Open Transport, System 7.6.1, 48 MB RAM): all four tests PASS
Linux (POSIX, x86_64): all four tests PASS
Public API: 20 functions, single header, fits on one screen
Codebase: 5,922 lines across all platforms (constitution caps it at 15,000)
Memory: zero allocation after PT_Init on all platforms

Every plan in Spec Kit has to pass a constitution check. plan.md tests all ten principles against the planned work. If a feature can't point to Bomberman, chess, or chat, it doesn't go in. Turns out what I actually needed was a document that could say "no."

#Spec Kit in practice

From the Spec Kit plan.md for PeerTalk:

The send path is deliberately simple. PT_Send frames the message and calls platform_ops->tcp_send or udp_send. No rate limiting, no send queue, no batching beyond what TCP provides. This is a deliberate reaction to v1, which grew rate limiters, token buckets, capability exchange, and multi-tier backpressure - none of which was needed by the three target applications.

The planning step in /speckit.plan starts with a research phase that resolves technical unknowns before any design work. research.md has ten decisions: build system, platform detection, memory strategy, async I/O model, chunking, discovery filtering, TCP accept model, logging, OT throughput, byte order. Each with alternatives and a rationale. In the original repo, these decisions got made as I went and ended up scattered across commit messages and session notes.

Locking the API contract to 20 public functions before any implementation code exists helped too. The original PeerTalk public header was 1,499 lines: 77 public functions, 19 enum types, a config struct with 17 fields for things like flow control thresholds and buffer pool pre-allocation strategies. The new header is 165 lines: 22 functions, 4 enum types, zero-config initialisation with just a name parameter. Seven send function variants became one PT_Send. Downstream projects like csend compile peertalk from source as an in-tree dependency and link against that single header.

#The implementation loop

The new PeerTalk repo has its constitution, specification, research decisions, data model, API contract, and task breakdown. Thirty-two tasks across eight phases. The original had hundreds across twelve.

autorun.sh isn't part of Spec Kit. I built it because the foundation felt solid enough to trust unattended. After running /speckit.implement manually a few times and watching it work through tasks cleanly — checking them off, verifying builds, moving to the next phase — I thought: the task format is just checkboxes, the progress is just grep, why am I sitting here? So I wrote a bash script that wraps /speckit.implement in a loop. It counts completed tasks by grepping for [x] markers, detects the current phase, builds the prompt with progress state, and runs Claude with --dangerously-skip-permissions. Exponential backoff for rate limits, stuck detection if the same task fails three times, build verification after each iteration.

⚠

About --dangerously-skip-permissions

This flag gives Claude unrestricted access to your machine: file writes, shell commands, network access, everything. I only use it on a dedicated build machine with nothing else on it, running inside a project directory I can nuke and reclone. Don't run this on a machine with credentials, production data, or anything you care about losing.

Before each iteration, it pings the three Classic Macs by IP address and port and includes their status in the prompt:

The following Classic Mac machines were tested for connectivity:
- performa6400 (PPC/OT): ONLINE (FTP + LaunchAPPL)
- performa6200 (PPC/MacTCP): ONLINE (FTP + LaunchAPPL)
- macse (68k/MacTCP): ONLINE (LaunchAPPL only, no FTP)

Claude gets this at the start of each session. If a Mac is online, Claude can deploy binaries and run hardware tests as part of implementation. If a Mac is off, Claude skips the hardware tasks and moves on to other work. The prompt also includes a feedback-first rule: when a hardware test reveals unexpected behaviour, run /speckit.feedback before implementing a fix.

The auto-detect feature means I don't need to tell it which spec to work on. It scans specs/ for the highest-numbered directory with incomplete tasks and picks that one.

I kicked it off and went to bed:

$ ./tools/autorun.sh
[2026-02-28 23:59:31] ==========================================
[2026-02-28 23:59:31] Overnight Build — Starting
[2026-02-28 23:59:31] Project: /home/matt/Desktop/peertalk
[2026-02-28 23:59:31] Tasks file: specs/001-peertalk-sdk/tasks.md
[2026-02-28 23:59:31] Log dir: logs/overnight-20260228-235931
[2026-02-28 23:59:31] Max iterations: 100
[2026-02-28 23:59:31] ==========================================
[2026-02-28 23:59:31] Initial state: 0 done, 32 remaining
[2026-02-28 23:59:31] Current phase: ## Phase 1: Setup
[2026-02-28 23:59:31] No build directory yet — skipping initial build check.
[2026-02-28 23:59:31] ------------------------------------------
[2026-02-28 23:59:31] Iteration 1/100
[2026-02-28 23:59:31] Progress: 0/32 done, 32 remaining
[2026-02-28 23:59:31] Phase: ## Phase 1: Setup
[2026-02-28 23:59:31] Next: T001 Create project directory structure
[2026-02-28 23:59:31] Running Claude... (log: logs/overnight-20260228-235931/iter-001-235931.log)

The first iteration ran for about 27 minutes and completed 24 of 32 tasks. Then it hit the usage limit:

You're out of extra usage · resets 3am (Europe/London)
[2026-03-01 00:26:27] Rate limit or error (exit code: 1, consecutive: 1)
[2026-03-01 00:26:27] Backing off 5m (resuming ~00:31:27)...

The script backed off exponentially: 5 minutes, 10, 20, 40, then capped at 60. It kept retrying through the night. At 3:41am the usage reset, and the final iteration picked up where it left off:

[2026-03-01 03:41:47] Iteration 7/100
[2026-03-01 03:41:47] Progress: 24/32 done, 8 remaining
[2026-03-01 03:41:47] Phase: ## Phase 7: User Story 5 — Cross-Platform Communication
[2026-03-01 03:41:47] Running Claude... (log: logs/overnight-20260228-235931/iter-007-034147.log)
ALL_TASKS_COMPLETE
[2026-03-01 04:17:58] ==========================================
[2026-03-01 04:17:58] Overnight Build — Complete
[2026-03-01 04:17:58] ==========================================
[2026-03-01 04:17:58] Tasks completed this run: 32
[2026-03-01 04:17:58] Total done: 32
[2026-03-01 04:17:58] Remaining: 0
[2026-03-01 04:17:58] Iterations used: 7
[2026-03-01 04:17:58] SUCCESS: All tasks implemented.
[2026-03-01 04:17:58] ==========================================

I woke up to 32 completed tasks across eight phases. 3,867 lines of code, 20 public functions, C89 compliant, zero malloc after init. This wouldn't have worked with the original approach. You can't put 180 KB phase files into an unattended loop. The script doesn't understand the project. It counts checkboxes and builds a prompt. Claude reads the task list, sees what's done, and picks up the next one.

#The feedback loop

32/32 tasks done, but that's not the end. PeerTalk's testing process involves three Classic Macs on a LAN. Each machine gets a different binary (68k for the Mac SE, PPC/MacTCP for the Performa 6200, PPC/OT for the Performa 6400). The MCP server handles all of this: it deploys binaries via Retro68's LaunchAPPL, uploads files over FTP, and fetches application logs back for analysis. The plan can get the architecture right, but things like OT import libraries not matching their headers, or MaxApplZone() needing to be called before any Memory Manager operation — those only show up when you build for the actual hardware and run it.

I wanted a way to feed what I saw from test runs straight back into the spec artifacts, on the same branch. Working on one branch through multiple implement-test-feedback cycles until you reach a stable point felt better than branching for every fix. That's what /speckit.feedback does. You tell it what you found and it appends a research entry, an edge case in the spec, and a fix task to the existing task list. It's append-only — it never rewrites what the other commands wrote.

The first round of feedback came from /speckit.analyze, which found two constitution violations in finished code that had all its tasks marked done: chunk reassembly used the sender's buffer sizes instead of the receiver's, and the memory budget was wrong on Classic Mac. The test apps crashed on every Mac too — written POSIX-first with platform stubs that compiled to no-ops. Each finding became a feedback invocation, and the task list went from 32/32 complete to 32/32 complete with fix tasks queued.

A second run picked up 16 remediation tasks. Testing that batch on three Macs found subtler issues: a TCP receive buffer too small for chunked messages, log filename collisions between test apps, missing Gestalt calls for machine identification in the logs. Seven more tasks from feedback. A third build dealt with those.

Spec Kit already keeps everything on one branch per feature, so autorun and feedback just work together naturally. autorun implements the tasks, I test on hardware, /speckit.feedback appends what I found, and autorun picks up the new tasks on its next run. Each cycle gets cheaper — the first build completed 32 tasks, the second handled 16, the third dealt with 7. Less new code each time, more verification against real hardware.

🛈

Closing the loop

Right now the feedback step is manual. autorun.sh tells Claude to use /speckit.feedback when it hits something unexpected during a session, but the test-and-feed-back cycle between runs still involves me. The obvious next step is having autorun run the hardware tests itself after each implementation pass, parse the results, and invoke /speckit.feedback automatically before starting the next iteration. The MCP server already handles deployment and log retrieval, so the pieces are there. The loop just needs wiring up.

#Spec Kit on everything

The first thing I did with Spec Kit was split PeerTalk into two repos: clog (the logging library) and peertalk (the SDK). Once the SDK was working and I had such solid results from autorun, I used Spec Kit to build csend — a cross-platform chat app — from scratch. That took about an hour and a half.

clog is the logging library. In v1, logging was baked into PeerTalk as PT_Log with its own phase plan (PHASE-0-LOGGING.md, 113 KB, 3,261 lines). Now it's a separate repo: 138 lines for POSIX, 162 lines for Classic Mac, 125-line header. Six functions, four macros. It has its own constitution with nine principles, including "no dynamic allocation" and "under 500 lines." Total: 729 lines across all files. PHASE-0-LOGGING.md alone was 3,261 lines — more than four times the size of everything in the clog repo!

peertalk is the SDK. Four core files (pt_core.c, pt_discovery.c, pt_messaging.c, pt_memory.c) plus three platform backends (POSIX, MacTCP, Open Transport). 5,922 lines of source. The old repo had 62,778 lines of C across six platform directories, because it also had an AppleTalk backend, a logging library, and things like queue.c (964 lines for the two-tier priority queue system) and send.c (the multi-layer send path). None of that exists in the new version because the constitution said none of the three target apps need it.

Each repo has the same structure: a .specify/ directory with constitution, templates, and scripts; a specs/ directory with numbered feature specifications; a tools/autorun.sh for unattended builds; and GitHub Actions CI that cross-compiles for all platforms. All three follow the same workflow.

csend is the chat app and the first real application built on the new SDK. The old peertalk repo planned two example chat apps but only chat_posix.c got built (559 lines, POSIX-only). The Classic Mac version never happened because the SDK kept growing. csend has both: a POSIX terminal interface and a Classic Mac GUI with a message window, peer list, and input field. 4,404 lines, five build targets, its own constitution. The entire networking layer is two bridge files (269 lines on POSIX, 142 on Classic Mac) that register PeerTalk callbacks and call PT_Poll from the main loop. I had it working in about an hour and a half with autorun — kept the GUIs for both POSIX and Classic Mac but rewired them to use the PeerTalk SDK.

csend running on Classic Mac OS System 7 — csend on Classic Mac OS

csend running on Ubuntu Linux — csend on Ubuntu

All three repos have GitHub Actions CI that cross-compiles for every platform on every push, using a Retro68 Docker container with the full toolchain pre-installed. A push to csend triggers five builds, each pulling both clog and peertalk as upstream dependencies and cross-compiling everything from source. The old repo had four workflows but only one cross-compilation job. Static analysis via cppcheck runs on every push too — dedicated cppcheck specs in peertalk and csend were just "fix cppcheck warnings," and autorun handled them overnight.

#What the numbers show

	Old PeerTalk (v1)	clog	peertalk	csend	Total (new)
Files	397	44	86	80	210
C/H source lines	62,778	729	5,922	4,404	11,055
Public header	1,499 lines (77 functions)	125 lines	165 lines (20 functions)	n/a	290 lines
Phase plans	14 (37,298 lines)	0	0	0	0
Spec artifacts	0	~1,200 lines	~5,300 lines	~380 lines	~6,900 lines
Claude Code skills	16	0	0	0	0
Hooks	4	0	0	0	0
CI jobs	4	4	5	5	14
Build targets	3	3	4	5	12
Constitution	none	8 principles	10 principles	10 principles	3 constitutions

The old repo had nearly six times more C code than all three new repos put together. 57 test source files became 10 test files that actually run on real hardware. 77 public functions became 20. 16 custom skills became zero, replaced by Spec Kit's built-in commands. And 14 phase plans totalling 37,298 lines became about 6,900 lines of spec artifacts spread across the three repos.

The skills went away because Spec Kit does the same things. /session used to navigate phase plans — now the task format uses [x] markers and navigation is a grep command. /implement used to spin up four subagents to gather context — now the artifacts are small enough that Claude reads them in full. /review ran six analysis passes — now /speckit.analyze does that against the constitution.

I kept the MCP server for Classic Mac hardware access, the domain-specific rules for MacTCP and Open Transport (they moved into research.md entries), and the general pattern of automated testing on real hardware. Those are project-specific capabilities, not planning tools, and they work alongside any planning approach.

#How I'll work from here

I didn't plan this workflow in advance. It came out of solving specific problems: the overnight build needed hardware checks (so autorun got machine connectivity), the analyze pass found constitution violations (so I built /speckit.feedback), the test apps crashed on Classic Mac (so research.md got entries about Toolbox initialisation), cppcheck found warnings (so spec 005 went through autorun too).

Going forward, any project with real complexity gets this workflow:

Write a constitution first. Say what the project does, what it doesn't do, and when it's done. Keep it short enough to read in 30 seconds.
Run the pipeline for each feature: /speckit.specify, /speckit.clarify, /speckit.plan, /speckit.tasks. Let the gating do its job. Don't skip steps.
Run /speckit.implement. If you trust the foundation, wrap it in autorun and walk away.
Test the output. Read the logs. If something failed, find out why before trying again.
Feed findings back in with /speckit.feedback. It appends to the existing artifacts rather than rewriting them. Then run implement again.
Push to CI. Let static analysis and cross-compilation catch what local builds miss. If it flags something, make a new spec and run it through.

🛈

You don't have to use the commands directly

You can still just talk to Claude and tell it to write things up using Spec Kit. I'd point Claude at a log file from a real Mac test run, describe what went wrong, and tell it to write it up using /speckit.feedback. Claude would format the findings as a research entry, an edge case, and a fix task, all in the right format for autorun to pick up later. Same thing works for new features — describe what you want in plain English and tell Claude to run the pipeline. The commands are there to enforce structure, but you can reach them through normal conversation too.

The next two projects are the other target apps from the constitution: a Bomberman clone and a human-vs-human chess game, both playable over LAN between Classic Macs and Linux. Same SDK, same workflow, same autorun. The networking is done and tested on every target machine, so these are pure application builds with proper guardrails from the start. I am really looking forward to it.