A Code Review of the Rapid Prototype

This one took a while. I reviewed the prototype code from my previous post, using the metrics dashboard as a starting point. The CI pipeline has 15 jobs tracking coverage, complexity, and security. Claude helped with the review - working through git history, explaining TypeScript patterns, and checking my analysis.

In the previous post, I described using Claude Code to build a recipe app in four days. The methodology included a CI/CD pipeline with 15 jobs tracking coverage, complexity, duplication, and security. Every metric passed. The dashboard was green.

This post examines metrics that looked pretty good on paper but missed structural problems that only showed up when I reviewed the code myself. If you want the short version, skip to the summary.

#The Dashboard

At commit cfaafa8, the metrics dashboard reported:

Metric Value Rating
Backend Test Coverage 47% B
Frontend Test Coverage 15.71% D
Backend Complexity (CC) 2.79 avg A
Backend Maintainability 86.14 A
Frontend Complexity 17 warnings C
Frontend Duplication 1.54% A
Backend Duplication 4.14% B
Frontend Security 0 vulns A
Backend Security 0 vulns A
Bundle Size 1,807 KB C
Legacy Lint 17 warnings C
Legacy Duplication 6.67% B
Python Lint (Ruff) 0 issues A
Dashboard metrics at commit cfaafa8 showing 13 tracked code quality indicators.

These ratings come from thresholds in the CI dashboard - mostly tool defaults (pytest-cov, radon, jscpd), with arbitrary cutoffs where tools don't provide them. I'll use them as a starting point to review the actual code.

#Metric by Metric Analysis

#1. Backend Test Coverage: 47%

Measures: Percentage of Python backend code executed during pytest test runs. Covers Django models, API endpoints, services, and middleware across the Django/Ninja API.

What the dashboard says: 47% of backend code is covered by tests.

The reality: The 47% is an average. Core APIs have 93-100% coverage. AI services have 10-25%. The scraper depends on websites that change their HTML structure. The AI services depend on LLMs that return unexpected responses. When the scraper fails, the error is obvious - parse failures, empty results. When the AI fails, the error is hidden - valid JSON with wrong content, hallucinated data, subtle schema violations. That means the code most likely to fail silently has the least test coverage: only 10-25% for AI services versus 93-100% for API endpoints.

Module Coverage What it contains
apps/recipes/api.py 93% Standard CRUD endpoints
apps/recipes/api_user.py 99% User-specific endpoints
apps/core/middleware.py 100% Request middleware
apps/ai/services/* 10-25% AI integration, validation, scaling
Backend coverage by module type at commit cfaafa8.
Coverage report filtered to show apps/ai/services and apps/recipes/services directories. AI services show 10-25% coverage while recipe services show 62-95% coverage.
Coverage report filtered to services directories. The AI services (top) average 17% coverage while recipe services (bottom) average 87%.

Why AI services have low coverage: The planning documents show the difference. PHASE-2-RECIPE-CORE.md built verification into every session: "Verify: pytest tests/test_models.py passes" appeared after each task. Tests were completion criteria, not optional extras.

PHASE-8B-AI-FEATURES.md took a different approach. The session scope table listed features without verification: "Recipe remix: Backend API + React UI → Complete." Tests were scheduled as a separate task (8B.10) for Session F. By the time tests were written, the implementation context had moved on, and the tests became a formality rather than a proper check on the code.

Test quality: The scraping code has good test design. The API tests mock the scraper to test HTTP handling, while test_scraper.py uses a mock_html_response fixture with real JSON-LD markup - mocking the network but testing actual parsing logic. test_user_features.py uses zero mocks, testing 34 real database scenarios for profile isolation.

The AI services mock at too high a level. The tests mock OpenRouterService entirely, so validation logic like _validate_value (cyclomatic complexity 40, 10% coverage) never runs against realistic AI responses. The fix: create mock_ai_response fixtures with realistic LLM outputs, like the scraper tests do with HTML.

What to improve:

Code fixes (would implement using planning mode with the implementable skill to verify the plan):

  1. Create AI response fixtures with realistic LLM outputs - like mock_html_response for the scraper (which uses real JSON-LD markup), create mock_ai_response fixtures with realistic AI responses including edge cases: missing fields, wrong types, malformed JSON, hallucinated data. This lets validation logic run against realistic inputs instead of being mocked away entirely.

  2. Refactor _validate_value before testing it - The validator.py module has 10% coverage. The _validate_value function (lines 163-260, CC=40) handles union types, recursive object validation, array validation, and type checking - but never runs in tests because the AI service is mocked entirely. Just adding tests would improve the coverage number but wouldn't make the code better. A function with CC=40 is difficult to test properly anyway. The real fix is in Backend Complexity #1: replace the hand-rolled validation with jsonschema, which reduces the function to a single library call. Then test that.

  3. Write tests for other AI services with low coverage - Using the same realistic fixtures approach:

    • scaling.py (241 lines) - ingredient scaling with unit conversion
    • remix.py (311 lines) - recipe variation generation
    • discover.py (278 lines) - recipe suggestions from preferences
    • ranking.py (162 lines) - scoring and ranking logic

Process fix:

  • Include verification in each session - The AI services had tests deferred to a separate session (8B.10), which led to tests that mock at too high a level. When tests are written alongside implementation, the context for edge cases and realistic inputs is fresh. The updated implementable skill now flags deferred testing as a common problem.

#2. Frontend Test Coverage: 15.71%

Measures: Percentage of React/TypeScript frontend code executed during Vitest test runs using React Testing Library. Covers components, hooks, screens, and utilities. Tests verify structure and behaviour (text rendering, click handlers, state updates) but not visual appearance.

What the dashboard says: 15.71% of frontend code is covered by tests.

The reality: I'm not a TypeScript expert by any means, but I can see what bad looks like - like a 337-line god component managing 10 screens with manual state-based navigation, or unsafe JSON parsing with no error handling. This section focuses purely on what the coverage metrics show - which code has tests and which doesn't. The Frontend Complexity section uses Claude Code to look deeper at the actual code quality and architectural issues.

The frontend coverage report shows coverage ranges from 61% for the API client down to 0% for reusable components:

Directory Coverage Lines What it contains
src/api/ 61% 612 API client with fetch
src/hooks/ 53% 408 useTimers, useWakeLock
src/lib/ 16% 166 Utility functions
src/screens/ 11% 3,837 10 screen components
src/components/ 0% 919 6 reusable UI components
Frontend directory coverage breakdown showing logic tested but UI components untested.
Frontend coverage report showing src/api at 60.93%, src/hooks at 52.66%, src/lib at 16.36%, src/screens at 10.74%, and src/components at 0% coverage.
Frontend coverage report by directory. The components directory has 0% coverage while API client and hooks have 53-61% coverage.

The pattern: logic (API client, hooks) is tested. UI (screens, components) is not. A test file named components.test.tsx actually tests screens:

import ProfileSelector from '../screens/ProfileSelector'
import Search from '../screens/Search'

describe('ProfileSelector', () => { ... }  // 6 tests
describe('Search', () => { ... }           // 10 tests

The 11% screens coverage is misleading. It's really 2 screens tested, 8 screens untested:

Screen Coverage Lines
ProfileSelector.tsx 83% 193
Search.tsx 80% 306
Settings.tsx 0% 1,314
RecipeDetail.tsx 0% 739
Home.tsx 0% 447
Collections.tsx 0% 189
PlayMode.tsx 0% 164
CollectionDetail.tsx 0% 148
Favorites.tsx 0% 75
AllRecipes.tsx 0% 70
Screen coverage showing 2 tested screens and 8 untested screens including the largest files.

The three largest files - Settings.tsx (1,314 lines), RecipeDetail.tsx (739 lines), and Home.tsx (447 lines) - have 0% coverage.

The actual components in src/components/ have 0% coverage:

Component Lines Used in Tests
RecipeCard.tsx 112 4 screens 0
TimerPanel.tsx 167 PlayMode 0
RemixModal.tsx 201 RecipeDetail 0
Skeletons.tsx 201 Multiple 0
AddToCollectionDropdown.tsx 145 RecipeDetail 0
TimerWidget.tsx 93 TimerPanel 0
Components in src/components/ showing usage and test coverage.

Why components have 0% coverage: The planning documents show the gap. PHASE-4-REACT-FOUNDATION.md specified task 4.9: "Write tests for profile and search API integration". PHASE-6-REACT-RECIPE-PLAYMODE.md specified task 6.6: "Write tests for timer logic and collections API".

The pattern: every testing task specified testing logic (hooks, API clients) or integration (API calls). No task said "test the RecipeCard component" or "test TimerPanel rendering". The AI tested exactly what the plan asked for.

Contrast with what worked: timers.test.ts has 23 tests for the useTimers hook because the plan specified "test timer logic". The hook has 86% coverage while the component that uses it (TimerPanel) has 0%.

Coverage report for src/hooks showing useTimers.ts at 86.4% statement coverage and useWakeLock.ts at 0% coverage.
Hooks coverage breakdown. useTimers.ts has 86% coverage because the plan said "test timer logic". useWakeLock.ts has 0% because it wasn't mentioned.

Why RecipeCard matters: RecipeCard is imported in 4 screens (Home, Favorites, AllRecipes, CollectionDetail). It handles image loading, favorite toggling, time formatting, and rating display. A bug in RecipeCard would break the primary browsing experience across the app, yet there's no test verifying it renders correctly.

Test quality: They verify real user interactions, not just "renders without crashing":

  • ProfileSelector tests (6 tests): Profile loading, selection callbacks, form submission, API integration
  • Search tests (10 tests): Source filtering, pagination, URL import flow, navigation
  • Timer tests (23 tests): Time parsing from recipe text, countdown mechanics, completion callbacks

There just aren't enough of them. The plan specified "API integration" and "timer logic", so those got tested well. It never mentioned component testing, so 6 components and 8 screens have zero coverage.

What to improve:

Looking beyond the coverage numbers (see Frontend Complexity for the detailed code review), the low coverage reveals architectural issues that make testing difficult:

  • App.tsx god component (337 lines, 10 state variables, manages 10 screens) - can't test screens in isolation
  • Settings.tsx state explosion (30 useState hooks) - complex state makes test setup difficult
  • Prop drilling - RecipeCard receives callbacks from 4 different screens, making unit tests require complex mocking
  • No routing library - manual state-based navigation couples screens to App.tsx

The god component pattern makes testing screens difficult because App.tsx manages all their state. Testing a screen in isolation would require mocking the entire App.tsx state tree. The 6 components in src/components/ are already isolated and can be tested without refactoring. The screens would need to be decoupled from App.tsx first.

Code fixes (priority order):

The same top 5 fixes from the Frontend Complexity section address these issues:

  1. Fix useTimers race condition
  2. Group Settings.tsx state with useReducer (30 useState → organized reducer state)
  3. Move data loading to onClick handlers (removes complex useEffect patterns)
  4. Extract formatTime to shared utility (removes duplication in RecipeCard and RecipeDetail)
  5. Wrap favoriteRecipeIds in useMemo (avoids unnecessary work on every render)

Additionally for coverage:

  1. Test all 6 components - The components directory has 6 small, isolated files (93-201 lines each). These are the easiest wins - no refactoring needed, just add test files. RecipeCard is the priority: it's imported in 4 screens (Home, Favorites, AllRecipes, CollectionDetail) and handles image loading, favourite toggling, and time formatting. A bug there breaks the primary browsing experience. Test the onFavoriteToggle callback, image fallback when recipe.image is null, and the inline formatTime function.

  2. Refactor Settings.tsx, then test the pieces - The 1,314-line file renders 6 tabs via inline conditionals. Extract each tab to a separate component, then test:

    • SettingsGeneral.tsx - API key validation, key persistence, status indicators
    • SettingsPrompts.tsx - Prompt editing/saving, model dropdown, active/disabled toggling
    • SettingsSources.tsx - Individual and bulk source toggles, state persistence
    • SettingsSelectors.tsx - CSS selector editing, individual/batch source testing
    • SettingsUsers.tsx - Profile list rendering, deletion flow with preview modal
    • SettingsDanger.tsx - Two-step reset confirmation, text input validation

Process fix:

  • Specify component testing in plans - instead of "build RecipeCard", write "build RecipeCard with tests for: renders without image, onFavoriteToggle callback fires, total_time formats correctly". Make component tests part of the definition of done, not a separate task. See the updated implementable skill for how this is now flagged during plan evaluation.

#3. Backend Complexity: A Rating (2.79 Average)

Measures: Cyclomatic complexity of Python functions using Radon. Counts the number of independent paths through code - more branches and loops mean higher complexity. Lower scores indicate simpler, more maintainable functions.

What the dashboard says: Average cyclomatic complexity of 2.79 earns an A rating.

The reality:

Rating Function Count % What it means
A (1-5) 408 88% Trivial: getters, one-liners, simple CRUD
B (6-10) 45 10% Moderate
C (11-20) 12 3% Complex - needs attention
D (21-30) 0 0%
E (31+) 1 <1% _validate_value at CC=40
Backend complexity rating: 408 functions score A (simple), 12 score C (moderate complexity), and 1 scores E (very complex). The average is 2.79, which rates as A.

The A rating is mathematically correct - 408 simple functions pull the average down. But 13 functions still need attention. The _validate_value function has a cyclomatic complexity of 40, making it difficult to test and risky to modify.

Maintainability: Half of the 12 C-rated functions are in apps/ai/services/ (5 functions), with another 4 in apps/recipes/services/. The AI services have 10-25% test coverage. The most complex code is the least tested.

Why it happened: No complexity limits in the plan. The AI was told to make validation work, not to keep it maintainable.

What to improve:

Code fixes:

  1. Replace hand-rolled validation with a library - The _validate_value function (CC=40) implements JSON Schema validation manually. The function has 98 lines with 20+ conditionals handling union types (lines 177-198), recursive object validation (lines 201-217), array validation (lines 219-242), and type checking. The schemas (lines 15-105) use JSON Schema dictionary notation. A library like jsonschema would replace the entire _validate_value function with a single jsonschema.validate(value, schema) call.

  2. Extract duplicate JSON parsing logic - complete and complete_async (both CC=12) contain identical 15-line blocks for extracting JSON from markdown code blocks. Extract this to a shared _parse_json_response(content: str) -> dict helper function.

  3. Refactor HTML parsing monolith - _extract_result_from_element (CC=20) does 6 sequential extraction tasks in 78 lines: find link, validate URL, extract title (4 fallbacks), parse rating from title with regex, extract image (3 fallbacks), extract description. Each task should be its own method: _find_link(), _extract_title(), _extract_rating(), _extract_image(), _extract_description(). This would reduce the main function to 6 one-line method calls.

Process fix:

  • Add complexity limits to CLAUDE.md - Include a rule like: "Max cyclomatic complexity per function: 15. If exceeded, refactor before completing." This catches the problem at planning time rather than during later review. The updated implementable skill now flags missing code quality gates as a common problem.

#4. Frontend Complexity: 17 Warnings (C Rating)

Measures: React/TypeScript frontend code quality using ESLint with the project's standard linting rules. This section analyses the "Frontend Lint" metric (17 warnings), not the separate "Frontend Complexity" metric (9 warnings from complexity-specific rules like max cyclomatic complexity and nesting depth).

What the dashboard says: The metrics dashboard shows 17 ESLint warnings in the Frontend Lint metric. All 17 are the same type: react-hooks/exhaustive-deps. That pattern alone suggests something structural rather than random errors.

The reality: As I mentioned earlier, I'm not a TypeScript expert. I can see obvious problems like Settings.tsx being 1,314 lines long and all of it inside one function. You don't need to know React to know that's not right. But I used Claude Code (Sonnet 4.5) to help me review the code, and it found things I wouldn't have caught - like a race condition in the timer hook that starts intervals before state is ready. The metrics pointed me in the right direction - when all 17 warnings are the same type, that's a pattern worth investigating. Claude helped me understand what I was looking at in the code, like this from Home.tsx:

useEffect(() => {
  // checks 5 state variables...
  if (activeTab === 'discover' && discoverSuggestions.length === 0 &&
      !discoverLoading && !discoverError && aiAvailable) {
    loadDiscoverSuggestions()
  }
}, [activeTab, aiAvailable]) // ...but only re-runs when 2 of them change

Even without React expertise, the problem is visible once it's explained: the effect checks five state variables but only re-runs when two of them change. If discoverLoading goes from true to false, the effect won't re-run to check if it should now load suggestions. Claude walked me through why this is wrong and what it means.

The full code review found that the ESLint warnings pointed to architectural issues. The god component (App.tsx manages 10 screens), prop drilling, race conditions, and state management problems. The review is 674 lines covering architecture, state management, error handling, TypeScript usage, code duplication, performance, and specific issues in every file - all explained in plain terms I could understand and verify against the actual code.

Why it happened: The plan said "build screens". It didn't specify navigation approach (routing library vs manual state management), didn't require fixing lint warnings, and ESLint runs in CI but exits successfully with warnings. The 17 warnings never blocked merging, so they accumulated.

What to improve:

Top 5 code fixes from the review:

  1. Fix useTimers race condition - useTimers.ts starts the interval before adding the timer to state. The first tick tries to update a timer that doesn't exist yet. Move the interval setup inside the setTimers callback so the timer exists before the first tick.

  2. Group Settings.tsx state with useReducer - Settings.tsx has 30 separate useState hooks. Related state for reset flow (showResetModal, resetPreview, resetStep, resetConfirmText, resetting) should be grouped into a single reducer state. Same for deletion flow, prompt editing, and source testing. Makes it impossible to update only some of the related state and introduce bugs.

  3. Move data loading to onClick handlers - Home.tsx has useEffect with five guard conditions to prevent infinite loops. This is a code smell. If you need guards to prevent a useEffect from running, the logic is in the wrong place. Move to button click:

    const handleDiscoverClick = () => {
      setActiveTab('discover')
      if (discoverSuggestions.length === 0 && !discoverLoading && aiAvailable) {
        loadDiscoverSuggestions()
      }
    }
  4. Extract formatTime to shared utility - The same time formatting logic appears in RecipeCard.tsx and RecipeDetail.tsx. Both components define their own formatTime function with identical code. Extract to lib/formatting.ts so changes don't need to be made twice.

  5. Wrap favoriteRecipeIds in useMemo - Home.tsx creates a new Set on every render:

    const favoriteRecipeIds = new Set(favorites.map((f) => f.recipe.id))

    This runs even when favorites hasn't changed. Wrap in useMemo(() => new Set(...), [favorites]) to avoid unnecessary work.

Process fixes:

  • Treat linter warnings as failures in CI - Updated the implementable skill to flag CI configurations that allow warnings to pass.

  • Require navigation approach for multi-screen applications - Plans creating multi-screen applications should specify navigation/routing architecture. For SPAs (React, Vue, etc.), specify routing library or manual state management. For mobile apps (React Native, Flutter), specify navigation library. Server-rendered frameworks (CakePHP, Rails, Django) have built-in routing. Manual navigation in SPAs tends toward god components managing app-wide state. Updated the implementable skill to flag multi-screen applications that don't address navigation architecture.

#5. Backend Duplication: 4.14% (B Rating)

Measures: Percentage of duplicated code in Python backend using jscpd. Detects copy-pasted code blocks that could be refactored into shared functions or base classes.

What the dashboard says: 4.14% code duplication in Python, B rating.

The reality: The duplication report shows 21 clones across 36 files with 283 duplicated lines. The top offenders:

The AI API error handling pattern appears 7 times with identical structure:

except AIUnavailableError as e:
    return 503, {
        'error': 'ai_unavailable',
        'message': str(e) or 'AI features are not available. Please configure your API key in Settings.',
        'action': 'configure_key',
    }
except (AIResponseError, ValidationError) as e:
    return 400, {
        'error': 'ai_error',
        'message': str(e),
    }

This exact 11-line block repeats in remix_suggestions, create_remix_endpoint, scale_recipe_endpoint, tips_endpoint, timer_name_endpoint, discover_endpoint, and repair_selector_endpoint endpoints.

Impact: The 4.14% duplication (B rating) is already flagging this as worth reviewing. The concentration in specific files makes it easier to fix - seven AI endpoints share the same error handling logic, so refactoring to a decorator would eliminate all seven copies at once. If the app needed internationalisation, the error message "AI features are not available. Please configure your API key in Settings." would need translating in seven separate locations instead of one.

Why it happened: No refactoring step in the plan. Once endpoints worked, the session was complete.

What to improve:

The full backend code review looks at where the duplication occurs and what improvements make sense. Most of the fixes overlap with the complexity issues already covered in Backend Complexity:

Additional fixes specific to duplication:

  1. Extract AI error handling to decorator - The same 11-line error handling block appears in 7 AI endpoints (ai/api.py lines 295-305, 365-375, 477-487, 534-544, 584-594, 630-640, 706-716). Create a @handle_ai_errors decorator to centralise error responses. This makes internationalisation easier - error messages would be defined once instead of in 7 locations.

  2. Fix AI service test mocking - apps/ai/tests.py mocks OpenRouterService entirely, so validation logic never runs against realistic AI responses. Create realistic_ai_response fixtures with actual JSON structures including edge cases (missing fields, wrong types, malformed data). But note that validator.py should be refactored first (see Backend Complexity #1) - no point testing a CC=40 function that's about to be replaced with a library call.

Process fix:

  • Include refactoring step after implementing similar features - Updated the implementable skill to flag plans that create multiple similar components (5+ API endpoints, view handlers, or service methods) without including a task to extract shared patterns. After implementing features that follow similar structures, plans should include a refactoring task to identify and extract common code into helpers, decorators, or base classes.

#6. Frontend Duplication: 1.54%

Measures: Percentage of duplicated code in React/TypeScript frontend using jscpd (copy-paste detector). Scans for identical or near-identical code blocks that could be refactored into shared functions or components.

What the dashboard says: Only 1.54% of frontend code is duplicated. A rating.

The reality:

File Lines What's inside
Settings.tsx 1,314 6 tabs as inline conditionals, 32 hooks, 29 functions
RecipeDetail.tsx 739 4 tabs inline, polling logic, scaling logic
All 6 components combined 919 RecipeCard, Skeletons, TimerPanel, etc.
Largest frontend files showing two screens contain more code than all components combined.

Code structure: Two screen files contain more code than the entire components directory. The tools count patterns across files, so if all the code is in one file there's nothing to duplicate. If all the code is crammed into monolithic files, there are no shared patterns across files to duplicate, so the score stays low. Evidence: RecipeCard is imported in 4 screens (actual reuse). The 6 settings tabs? Zero reuse - they're all inline in one 1,314-line file.

Why it happened: The plan said "build the settings screen with tabs". It didn't say "build settings as composable tab components" or "no single file over 300 lines".

What to improve:

Code fixes:

The monolithic files identified here are the same ones that need refactoring for test coverage. See Frontend Test Coverage for the detailed refactoring plan:

  1. Extract Settings.tsx tabs into components - Detailed in Frontend Coverage fix #2. Breaking the 1,314-line file into 6 tab components enables both testing and reuse.

  2. Extract RecipeDetail tabs into components - RecipeDetail.tsx renders 4 tabs inline in a 739-line file. Extract to separate components (RecipeIngredients, RecipeInstructions, RecipeNutrition, RecipeTips). This reduces file size by 80% and creates testable, reusable components.

Process fix:

  • Flag monolithic files in plans - the updated implementable skill now flags large files (300+ lines) that inline logic instead of extracting to separate modules. Low duplication scores can be misleading - if all the code is in one massive file, there's nothing to duplicate.

#7. Legacy Duplication: 6.67% (B Rating)

Measures: Percentage of duplicated code in the legacy ES5 JavaScript frontend using jscpd. Higher than modern frontend due to lack of shared utilities infrastructure.

What the dashboard says: 6.67% duplication in legacy JavaScript.

The reality: The full code review found that the 6.67% duplication is conservative. It only measures text duplication, not architectural duplication. The legacy codebase is 4,623 lines of ES5 JavaScript split across 9 page modules, with the same god component pattern as the modern frontend:

File Lines Purpose
detail.js 1,275 Recipe detail, favorites, collections, AI scaling, remix, tips
settings.js 1,100 6 tabs (API keys, prompts, sources, selectors, profiles, reset)
search.js 603 Search screen
play.js 587 Play mode screen
Legacy JavaScript file sizes showing two god components (detail.js and settings.js) that mirror the modern frontend's Settings.tsx (1,314 lines) and RecipeDetail.tsx (739 lines).

Concrete duplication examples:

Why duplication is higher than modern frontend (1.54%):

The codebase already uses a Cookie namespace for shared functionality (Cookie.ajax, Cookie.toast, Cookie.state). But no shared utilities were created for common functions. Each page file reimplements escapeHtml(), handleTabClick(), and time formatting because there's no Cookie.utils to import them from. Without ES6 modules, the only option is to add utilities to the global Cookie object and load them via script tags before the page modules.

Why it happened: The plan said "build legacy frontend for iOS 9 support". It created the Cookie namespace for AJAX and state management but didn't specify which utility functions should be shared. Each page module was built independently, reimplementing common functions as needed.

What to improve:

Code fixes:

The full code review identifies specific patterns to fix:

  1. Add Cookie.utils for shared functions - Extract escapeHtml(), handleTabClick(), and formatTime() into a new Cookie.utils object in apps/legacy/static/legacy/js/utils.js. Load via script tag before page modules. This eliminates 5 copies of escapeHtml() and 3 copies of handleTabClick(). The infrastructure already exists (the Cookie namespace), it just needs utilities added to it.

  2. Split settings.js into tab-specific files - The 1,100-line file manages 6 tabs inline. Split into separate files (settings-general.js, settings-prompts.js, etc.) and load them on demand. This requires adding a simple script loader to the Cookie namespace since ES5 has no native import mechanism.

  3. Replace event listener loops with delegation - 16+ loops in settings.js attach individual listeners. Replace with single delegated listener checking event.target.dataset.action. Reduces listener count from 100+ to 1.

Process fix:

  • Define shared utilities upfront - when creating a global namespace object (Cookie, App, etc.), specify which utility functions belong there before building page modules. For this codebase: "Create Cookie.utils with escapeHtml, formatTime, handleTabClick. All page modules must use Cookie.utils for these functions, not reimplement them." The updated implementable skill now flags plans that create 4+ similar modules without defining shared utilities first.

#8. Security Vulnerabilities: 0

Measures: Known vulnerabilities in project dependencies using npm audit (frontend packages) and pip-audit (backend packages). Scans for published CVEs in installed libraries.

What the dashboard says: Zero vulnerabilities detected.

The reality: The CI pipeline runs two security scans:

  • npm audit - Frontend packages in CI: working, 0 vulnerabilities
  • pip-audit - Backend packages in CI: working, 0 vulnerabilities

But pip-audit isn't available locally. The requirements.txt doesn't include it, and the Dockerfile only installs from requirements.txt. The CI job installs pip-audit separately before running the scan.

$ docker compose exec -T web pip-audit
exec: "pip-audit": executable file not found in $PATH

Scanner gaps: pip-audit works in CI but can't be run during local development. The security scanning only covers dependencies (npm audit for modern frontend, pip-audit for backend). There's no SAST for code-level issues in any codebase. The backend has no Bandit scanning for Python code. Both JavaScript frontends lack eslint-plugin-security. The modern frontend has basic security rules (no-eval) from the ESLint recommended config, but no XSS detection for innerHTML, dangerouslySetInnerHTML, or document.write. The legacy code has ESLint with explicit security rules (no-eval, no-implied-eval) but runs with || true, so warnings don't fail the build. That's 4,623 lines of ES5 JavaScript handling user input with warnings-only enforcement. No secrets detection (gitleaks, detect-secrets) runs anywhere.

Why it happened: My planning with Claude for the CI tooling was not thorough. I didn't specify what security scanning should exist or how to verify it worked locally. That led to a vague plan, which led to a half-finished implementation. The original plan didn't include security scanning at all - it was added later as an afterthought without proper verification.

What to improve:

Code fixes:

  1. Add pip-audit to requirements.txt - The CI installs it separately with pip install pip-audit, but it's missing from requirements.txt. Add it so local development can run the same security checks: docker compose exec -T web pip-audit should work, not error with "executable file not found".

  2. Add SAST scanning for Python - Install and configure Bandit for Python static analysis. Add to CI pipeline to catch code-level security issues like hardcoded credentials, SQL injection patterns, unsafe pickle usage, and insecure random number generation.

  3. Add SAST scanning for JavaScript - Install eslint-plugin-security for both frontends to catch XSS vulnerabilities (innerHTML, dangerouslySetInnerHTML, document.write()) and regex DoS patterns. The modern React frontend has basic security rules (no-eval from ESLint recommended) but lacks XSS detection. The legacy ES5 code has explicit rules (no-eval, no-implied-eval) but runs with || true so warnings don't fail the build. Add the plugin to both configs and remove || true from legacy CI.

  4. Add secrets detection - Use detect-secrets or gitleaks to prevent committing credentials, API keys, or tokens. Run as a pre-commit hook and in CI to catch accidental commits before they reach the repository.

Process fix:

  • Require basic security scanning in plans - projects with dependencies should include dependency vulnerability scanning in CI (npm audit for Node.js, pip-audit for Python, govulncheck for Go, bundler-audit for Ruby). Updated the implementable skill to flag: (1) missing security scanning when dependencies exist, (2) unverified tooling (configured but doesn't execute), and (3) partial coverage (only frontend or backend scanned).

#9. Backend Maintainability: 86.14 (A Rating)

Measures: Maintainability Index of Python backend code using Radon. Combines cyclomatic complexity, lines of code, and Halstead volume (code vocabulary/difficulty) into a single 0-100 score. Higher scores indicate more maintainable code.

What the dashboard says: Maintainability Index of 86.14, A rating.

The reality: Maintainability Index combines complexity, lines of code, and Halstead volume into a single score. An A rating suggests the code is maintainable.

Maintainability analysis: The Maintainability Index combines complexity, lines of code, and code vocabulary into a single 0-100 score. The average is 86.14 (A rating). However, hundreds of simple files pull that average up. When you look at individual files, apps/ai/services/validator.py scores much lower because of the CC=40 function.

The metric confirms what we already know: the AI services are the problem area.

Why it happened: Same root cause as complexity - no code quality constraints in the plan.

What to improve:

Code fixes:

The maintainability issues are caused by the same high-complexity functions already covered in Backend Complexity. All three fixes from that section apply here:

  1. Replace manual validation with jsonschema - See Backend Complexity fix #1
  2. Extract JSON parsing logic - See Backend Complexity fix #2
  3. Refactor HTML parsing monolith - See Backend Complexity fix #3

Reducing cyclomatic complexity in these functions will directly improve their maintainability index scores.

Process fix:

  • Require code quality gates in CLAUDE.md - The root cause was "no code quality constraints in the plan". Including complexity limits (e.g., "max cyclomatic complexity per function: 15") would have caught the CC=40 function during implementation. The updated implementable skill now flags plans missing code quality gates (complexity limits, file size limits, coverage thresholds) in CLAUDE.md or project config files.

#10. Python Lint Issues: 0

Measures: Code quality and style issues in Python backend using Ruff (fast Python linter). Checks for errors, code style, security issues, unused imports, type hint modernisation, and Django-specific patterns.

What the dashboard says: Ruff reports zero linting issues.

The reality:

The pyproject.toml ignores 27 rules:

ignore = [
    "E501",   # line too long
    "B905",   # zip without strict=
    "SIM103", # return condition directly
    "SIM108", # use ternary operator
    "SIM102", # nested if statements
    "I001",   # import sorting
    # ... 21 more rules
]

Code quality: The zero issues report doesn't reflect the actual code quality. The pyproject.toml disables 27 linting rules, so the linter simply doesn't report those issues.

Why it happened: The pyproject.toml was created in a single commit with all 27 rules disabled. Many include comments like "fix incrementally" or "review later", indicating an intention to enable them gradually. But there's no tracking document, no plan to re-enable rules, and no follow-up commits. The "gradual adoption" never happened.

What to improve:

Code fixes:

  1. Fix auto-fixable violations and enable the rules - Ruff can auto-fix many violations. Run ruff check --fix, commit the changes, and remove the rules from the ignore list. Auto-fixable rules: I001 (import sorting), UP006 (use list instead of List), UP007 (use X | Y for Union), C408 (unnecessary dict call).

  2. Enable remaining rules in CI only - Rather than disabling rules globally, enable them in CI but allow local development to continue. Configure GitHub Actions to run ruff check without the ignore list. PRs fail if they introduce violations. This prevents new violations while allowing gradual cleanup of existing ones.

  3. Use inline suppressions for legitimate exceptions - For cases where a rule genuinely doesn't apply, use inline # noqa comments with explanations rather than globally disabling the rule:

    from typing import TYPE_CHECKING
    if TYPE_CHECKING:
        from apps.recipes.models import Recipe  # noqa: F401 - needed for type hints

Process fix:

  • Prevent permanently relaxed linter standards - plans that disable linter rules "for gradual adoption" must include either: (1) tasks to fix violations and re-enable rules, (2) stricter CI config that prevents new violations, or (3) justification for why rules should stay disabled. Updated the implementable skill to flag globally disabled linter rules without a path to enforcement.

#11. Legacy Lint: 17 Warnings (C Rating)

Measures: Code quality in the legacy ES5 JavaScript frontend (for iOS 9 support) using ESLint. Checks for unused variables, undefined globals, and complexity issues in vanilla JavaScript without module support.

What the dashboard says: 17 ESLint warnings in legacy ES5 code.

The reality:

The ESLint configuration sets strict limits for ES5 code:

Rule Limit Purpose
complexity 15 Max cyclomatic complexity per function
max-lines-per-function 100 Max lines per function
max-depth 4 Max nesting depth
no-unused-vars warn Catch dead code
no-shadow warn Catch variable shadowing
ESLint quality rules for legacy ES5 JavaScript targeting iOS 9 compatibility.

The ESLint report shows 17 warnings distributed across files:

File Warnings Issues
settings.js 8 Function too long, variable shadowing, unused vars
detail.js 4 Complexity exceeded, variable redeclaration
Other files 5 Various (ai-error.js, search.js, time-detect.js, timer.js)
Distribution of 17 ESLint warnings across legacy JavaScript files.

Specific violations:

  • Complexity breach: handleError function has complexity 16 (exceeds limit of 15)
  • Lines breach: setupEventListeners has 159 lines (exceeds limit of 100)
  • Variable shadowing: Loop variable i redeclared across multiple for-loops, profileId shadowing outer scope
  • Unused variables: resetPreview assigned but never referenced
  • Variable redeclaration: no-redeclare violations in detail.js

The warnings come from the same architectural issues covered in Legacy Duplication: god components (settings.js at 1,100 lines, detail.js at 1,275 lines) that mix too many concerns.

Why it happened: The plan said "build legacy frontend for iOS 9 support" but didn't specify quality requirements. The ESLint config exists with good rules, but there was no requirement to fix warnings before merging.

What to improve:

Code fixes:

The lint warnings come from the same architectural issues detailed in Legacy Duplication. The same fixes from that section address the root causes:

  1. Extract settings.js tabs into separate modules - See Legacy Duplication fix #2. Breaking the 1,100-line file into 6 smaller modules would eliminate the functions exceeding 100 lines.

  2. Extract detail.js features into separate modules - The 1,275-line file has functions with complexity > 15 for serving adjustment and remix handling. Splitting into separate modules reduces complexity naturally.

  3. Replace event listener loops with delegation - See Legacy Duplication fix #3. This eliminates the 16+ loops that cause variable shadowing warnings.

Process fix:

  • Require fixing lint warnings before merging - The ESLint config has good rules (complexity ≤ 15, max 100 lines per function), but warnings were allowed to accumulate because there was no enforcement. Configure CI to fail on warnings or require "All ESLint warnings fixed" as a task completion criterion. The updated implementable skill now flags linter configurations that allow warnings to pass CI.

#12. Bundle Size: 1,807 KB (C Rating)

Measures: Total size of JavaScript and CSS files in the modern frontend production build (React/TypeScript). Includes application code, dependencies, and source maps generated by Vite bundler. Does not include the legacy ES5 frontend (187 KB of separate static files for iOS 9 support).

What the dashboard says: 1,807 KB total bundle size, C rating.

The reality:

The actual shipped bundle is smaller:

  • JS: 357 KB (101 KB gzipped)
  • CSS: 38 KB (7 KB gzipped)

The 1,807 KB figure includes source maps (1,413 KB) which aren't served to users.

Actual size: For a recipe app with AI features, timers, and offline capability, 357 KB of JavaScript is reasonable. The dashboard reports 1,807 KB because it includes source maps (1,413 KB) and uncompressed code. What users actually download is 101 KB gzipped - that's why the C rating seems harsh compared to actual user experience.

Why it happened: The dashboard reports raw build output. No one configured it to show what actually matters: gzipped transfer size.

What to improve:

Dashboard fix:

  1. Exclude source maps from bundle size calculation - Modify the CI bundle analysis script to skip .map files when calculating total size. Source maps aren't served to users, so including them (1,413 KB) makes the metric misleading. Change line 909 to: if (fs.existsSync(distDir)) { fs.readdirSync(distDir).filter(file => !file.endsWith('.map')).forEach(file => {

  2. Report gzipped transfer sizes - Add gzip calculation to the CI script to show what users actually download (101 KB JS + 7 KB CSS = 108 KB total) rather than uncompressed sizes (357 KB + 38 KB = 395 KB). Use Node's zlib module to compress each file and report both raw and gzipped sizes.

Optional code optimizations (not urgent):

The current bundle size is reasonable for the app's features. If you wanted to optimize further:

  1. Add route-based code splitting - Use React lazy imports to split screens into separate chunks. Users would only download the code for screens they visit. For example, Settings.tsx (1,314 lines) could be loaded on demand rather than in the initial bundle.

  2. Lazy load AI features - The AI integration code could be loaded only when users access AI features (remixing, tips, discover), reducing the initial bundle for users who don't use those features.

#13. Metrics Tooling & CI Setup

Measures: Code review of the CI pipeline configuration and metrics collection scripts that generate the dashboard data.

The setup: 15 CI jobs collecting coverage, complexity, duplication, security, and bundle size metrics. Custom Python scripts generate JSON reports and HTML dashboards deployed to GitHub Pages.

Code quality of the metrics tooling itself: The CI setup works but has significant code quality issues. The entire dashboard generation system (800+ lines of Python and JavaScript) is embedded in YAML heredocs, making it difficult to test or debug. Error handling throughout uses bare except: pass blocks that silently fail and default to zero when something goes wrong, which makes debugging nearly impossible.

The bundle size metric is misleading because it includes source maps that are never served to users. The actual bundle is around 357 KB of JavaScript, but the metric reports 1,807 KB total because it counts the 1,413 KB of .map files. The rating system shows 'C' when the actual served bundle would rate 'A'.

Several tools are configured inconsistently. The backend security job installs pip-audit at runtime rather than including it in requirements.txt, so it works in CI but not locally. ESLint runs on legacy JavaScript with || true, which means warnings are logged but don't fail the build, allowing code quality to degrade over time.

The rating logic (what counts as an 'A' or 'D') is duplicated 10+ times across different jobs. Changing a threshold requires updating multiple locations. The Python scripts for extracting metrics all use the same pattern but are copied rather than shared as a reusable function.

Why it happened: This is where I got lazy. On day 4 of the prototyping challenge, I asked Claude to add metrics collection without putting much thought into the plan. The goal was to collect data for the blog post, not to build production-quality tooling. I didn't review the implementation myself - I just asked Claude to make it work and moved on. With poor prompting and no clear guidance, the model apparently decided that embedding 800+ lines of Python and JavaScript in YAML heredocs was the path of least resistance. The CI workflows are currently the most wild west coded part of the entire project.

When it came time to review the metrics tooling itself, I got lazy again and asked Claude to review it without reading the code myself. That's how I discovered the 800+ lines of embedded Python and the misleading bundle size calculation - from Claude's review, not from actually looking at what I'd built.

See the full metrics tooling review for detailed analysis of the CI architecture, data accuracy issues, and security concerns.

What to improve:

  1. Fix bundle size calculation (ci.yml:908-920) - Add if (file.endsWith('.map')) return; to skip source maps. The metric currently reports 1,807 KB when users download ~500 KB.
  2. Replace except: pass with proper error handling (throughout coverage.yml) - At minimum, log what went wrong instead of silently defaulting to 0.
  3. Move Python scripts to .github/scripts/ - Extract the 800+ lines of inline heredocs into separate files to break it up a bit, but not much more. I'm not doing a detailed code review of the CI scripts themselves - they're infrastructure plumbing that works well enough to generate the metrics used in this review.
  4. Add pip-audit to requirements.txt (ci.yml:815) - Developers can't run security checks locally because the tool isn't in the dependency file.
  5. Calculate gzipped bundle sizes - The uncompressed metric doesn't reflect what users download. A 1,807 KB bundle compresses to ~200-300 KB.
  6. Extract rating thresholds into constants - The rating logic is duplicated 10+ times. Changes require updating multiple locations.

Process fix:

  • Flag CI configs with embedded scripts - Plans should extract substantial scripts (>50-100 lines) to separate files for testability rather than embedding them in YAML heredocs. The updated implementable skill now detects embedded scripts in CI configs and suggests extracting them to .github/scripts/ or similar. Also suggests adding CI for projects with test suites but no automation.

#14. Summary

Overall Assessment

The prototype code works and it isn't anywhere near the worst I've ever worked with. Everything I've found in this review is fixable, and the land and expand approach outlined in the first blog post has resulted in working code miles ahead of my early experiments with AI and an all-in-one plan approach. It still blows my mind how much of the prototype worked right off the bat.

Planning Drives Quality

Where the quality dropped was generally where my prompting and planning also dropped in quality. Where the plans said "build X" without specifying quality constraints, architectural approach, refactoring steps, or test requirements as completion criteria, I got pretty much what I asked for. Tests scheduled for a separate session became tests that mock at too high a level. The god component pattern appeared in both the React frontend (Settings.tsx at 1,314 lines) and the legacy ES5 code (settings.js at 1,100 lines) - same architectural problem regardless of technology, which points to planning rather than implementation.

The best example of planning failure is the metrics tooling itself. On day 4 I got lazy and asked Claude to add metrics collection without putting enough thought into the plan. I ended up with 800+ lines of Python and JavaScript embedded in YAML heredocs, bare except: pass blocks, and misleading calculations. The metrics were still useful but needed human interpretation - coverage percentages didn't show which code was covered, duplication scores didn't show whether code was in shared utilities or crammed into god components, and bundle size included source maps that are never served to users.

Process Improvements

I made seven updates to the implementable skill to hopefully catch these issues in future plans: missing refactoring steps, globally disabled linters, embedded CI scripts, missing code quality gates, missing test requirements, monolithic files, and linters that don't fail on warnings. I think it's also time to try something along the lines of spec-kit to see how it compares to the approach taken to rapidly build Cookie.

Takeaways

Poor planning equals poor implementation. AI doesn't change that. I trusted Claude would figure out what I meant instead of being specific about what I wanted. That was the mistake on day four when I asked for the CI setup. Be explicit in your plans.

If you're using AI to build software, you need CI and metrics. It's actually easier to add them with AI than it was before, so the barrier is lower. Even if the implementation is rough to start with (like my CI setup was), having the metrics lets you spot problems sooner.

I actually enjoyed doing this code review with Claude. But "do a code review" isn't enough on its own - you need to tell it to use the repo, dig through git history, use the gh CLI if it needs to. Once you have the first report, ask Claude to fact check its own findings and add links to the actual code. That second pass catches mistakes and gives you something you can verify yourself.

What's Next

I made a phased implementation plan based on everything found in this review. It's organized into 8 phases with 43 tasks covering backend refactoring, frontend architecture, legacy JavaScript cleanup, and metrics tooling improvements. The tasks are ordered by risk and priority, so the critical bugs get fixed first and the polish comes later.

Doing this code review took longer than I expected, even with Claude helping. Reading through 5,000+ lines of code across four different codebases, cross-referencing metrics, writing up findings in detail, and then organizing all the fixes into a coherent plan turned into a proper time sink. In future I'm going to stick to shorter posts about specific topics rather than trying to review an entire prototype in one go.