Metrics Tooling & CI Setup Review

Repository: https://github.com/matthewdeaves/cookie Commit: cfaafa8 Files Reviewed:

.github/workflows/ci.yml (977 lines)
.github/workflows/coverage.yml (1014 lines)

This review was generated by Claude Opus 4.5 through direct analysis of the CI workflow files and metrics collection scripts. The review examines code quality, data accuracy, and maintainability of the metrics tooling infrastructure itself.

Overview

The Cookie repository uses two GitHub Actions workflows to collect code quality metrics and publish them to a dashboard on GitHub Pages. The CI workflow runs 15 separate jobs collecting metrics for coverage, complexity, duplication, security, bundle size, and linting across frontend React/TypeScript, backend Django/Python, and legacy ES5 JavaScript codebases. The coverage workflow then downloads all the artifacts, generates badges, creates a historical record, and publishes an interactive HTML dashboard with trend graphs.

The system collects 12 distinct metrics from various tools: Vitest for frontend coverage, pytest-cov for backend coverage, radon for Python complexity, ESLint for JavaScript complexity/linting, jscpd for duplication detection, npm audit and pip-audit for security scanning, and Vite for bundle analysis.

CI Pipeline Architecture

Job Organization

The CI workflow defines 15 jobs split across three categories:

Linting (3 jobs):

frontend-lint: ESLint on React/TypeScript frontend
backend-lint: ruff check on Django backend
legacy-lint: ESLint on ES5 legacy JavaScript

Testing (3 jobs):

frontend-typecheck: TypeScript compiler check
frontend-test: Vitest with coverage
backend-test: pytest with coverage

Metrics Collection (7 jobs):

backend-complexity: radon cyclomatic complexity and maintainability index
frontend-complexity: ESLint complexity warnings
frontend-duplication: jscpd on TypeScript/React
backend-duplication: jscpd on Python
legacy-duplication: jscpd on ES5 JavaScript
frontend-security: npm audit
backend-security: pip-audit
frontend-bundle: Vite build size analysis

Validation (1 job):

ci-success: Checks all jobs passed

Publishing (1 workflow):

coverage.yml: Runs on CI completion, downloads artifacts, generates dashboard

Parallelisation

All metric collection jobs run in parallel with no dependencies between them, which is efficient. The ci-success job uses needs: to wait for 11 jobs to complete before validating their results. The coverage workflow runs separately via workflow_run trigger after CI completes successfully.

Each job runs independently and uploads its results as artifacts with 30-day retention. The separate coverage workflow then downloads all artifacts and processes them together.

Good Patterns

The pipeline has several well-designed aspects:

Separation of concerns: Metric collection runs in CI, dashboard generation runs in a separate workflow. This prevents CI failures if dashboard generation breaks.
Parallel execution: All metric jobs run simultaneously rather than sequentially, making efficient use of GitHub Actions minutes.
Artifact-based communication: Jobs communicate through artifacts rather than job outputs, which is more robust for large data.
Path filtering: The workflow skips for changes to markdown, docs, and plans, avoiding unnecessary runs.
Concurrency control: The concurrency section cancels in-progress runs when new commits arrive.
Defensive programming: Most metric jobs use || true to prevent metrics failures from blocking builds.

Issues Found

Several problems exist with the job structure:

ESLint warnings don't fail the build: The frontend linter (line 48) runs npm run lint which will fail on errors. However, the legacy lint job runs ESLint with || true three times (lines 69, 77, 83), which means warnings and errors are logged but don't fail the build. This allows lint issues to accumulate over time.

Inconsistent failure handling: The ci-success job checks that core jobs passed (lines 950-957) and frontend/legacy quality checks passed (lines 958-969), but treats backend complexity and duplication as informational (lines 970-976). The rationale for this inconsistency is unclear.

No dependency caching for Node tools: The legacy lint job and duplication jobs install npx jscpd fresh each time. While this ensures the latest version, it adds time to each run. The frontend jobs properly use npm cache via cache: 'npm' (line 41), but legacy and backend jobs using Node.js don't.

Artifact download failures are silent: The coverage workflow uses continue-on-error: true for all artifact downloads (lines 29, 38, 47, etc.). If an artifact is missing, the dashboard generation continues but displays incorrect or stale data without warning the user.

Metrics Collection Scripts

Coverage Collection

Frontend: The frontend-test job runs Vitest with npm run test:coverage, which generates coverage in multiple formats. The configuration is clean and straightforward.

Backend: The backend-test job runs pytest with explicit coverage configuration using --cov=apps --cov=cookie to measure both directories. It generates XML, HTML, JSON, and terminal reports. The configuration is explicit and correct.

The coverage extraction in the dashboard (coverage.yml lines 256-270) reads coverage-summary.json for frontend and coverage.json for backend. The error handling uses try/except: pass which silently fails if files are missing, defaulting to 0% coverage.

Complexity Analysis

Backend: The backend-complexity job runs four radon commands generating cyclomatic complexity, maintainability index, raw metrics, and Halstead metrics. The Python script (lines 332-493) calculates average complexity across all functions and average maintainability across all files.

The rating system is defined as:

CC rating: A if ≤5, B if ≤10, C if ≤20, else D
MI rating: A if ≥80, B if ≥60, C if ≥40, else D

Frontend: The frontend-complexity job runs ESLint with complexity rules inline. It counts complexity warnings but doesn't calculate actual complexity values.

The rating is based solely on warning count: A if 0 warnings, B if ≤5, C if ≤10, else D. This is a weak metric compared to the backend's actual cyclomatic complexity measurement.

Duplication Detection

All three duplication jobs use jscpd with similar patterns but different thresholds:

Frontend (lines 596-604): 50 tokens, 5 lines minimum
Backend (lines 700-709): 50 tokens, 5 lines minimum
Legacy (lines 649-656): 30 tokens, 4 lines minimum (more sensitive)

The summary extraction (lines 607-628, lines 712-733, lines 659-680) reads the jscpd JSON report and extracts the duplication percentage. The error handling creates an empty report structure if parsing fails.

Rating thresholds vary by codebase:

Frontend/Backend: A if ≤3%, B if ≤5%, C if ≤10%, else D
Legacy: A if ≤5%, B if ≤10%, C if ≤15%, else D (more lenient)

Bundle Analysis

The frontend-bundle job builds the frontend and analyses the dist/assets directory. The Node.js script (lines 898-932) walks the directory and sums file sizes.

Critical issue: The script includes ALL files in the directory, including source maps:

fs.readdirSync(distDir).forEach(file => {
  const filePath = path.join(distDir, file);
  const stats = fs.statSync(filePath);
  const sizeKB = Math.round(stats.size / 1024 * 100) / 100;
  totalSize += stats.size;

  if (file.endsWith('.js')) jsSize += stats.size;
  if (file.endsWith('.css')) cssSize += stats.size;

This code counts .js and .css files separately but adds ALL files to totalSize, including .map files which aren't served to users. The actual bundle served to users would be significantly smaller.

The rating thresholds are also misleading because they include source maps: A if <500KB, B if <1000KB, C if <2000KB, else D. A 1807KB bundle might rate as 'C' when the actual served bundle (excluding source maps) would be under 500KB and rate as 'A'.

No gzipped sizes: The bundle is reported as uncompressed bytes. Modern web servers serve JavaScript and CSS with gzip or brotli compression, typically reducing size by 70-80%. A 1807KB uncompressed bundle might transfer as only 200-300KB. The metric doesn't reflect this.

Security Scanning

Frontend (lines 761-790): Runs npm audit and extracts vulnerability counts from the JSON output. The rating is: D if any critical, C if any high, B if any moderate, else A. This is sensible.

Backend (lines 817-864): Installs pip-audit via pip install pip-audit (line 815) rather than including it in requirements.txt. This means pip-audit works in CI but not locally, making it harder to test security fixes during development.

The pip-audit parser (lines 828-864) handles two different JSON formats, which suggests the output format changed between versions. The rating is: A if 0 vulnerabilities, B if ≤2, C if ≤5, else D. Unlike npm audit, it doesn't use severity levels because pip-audit doesn't report them in the JSON format being parsed.

Dashboard Generation Code

The dashboard generation happens entirely in coverage.yml via embedded Python and Bash scripts.

Code Quality Issues

Inline Python in YAML heredocs: The entire dashboard generation consists of Python code embedded in YAML heredocs (lines 200-219, lines 222-487, lines 490-559). This makes the code difficult to test, debug, and maintain. The scripts can't be run locally without copying them out of the YAML file, and there's no syntax highlighting or IDE support.

Error handling swallows all exceptions: Nearly every data extraction uses this pattern:

try:
    with open('coverage/frontend/coverage-summary.json') as f:
        d = json.load(f)
        frontend_cov = d['total']['lines']['pct']
except: pass

The bare except: pass catches ALL exceptions including syntax errors, file not found, JSON parse errors, and KeyError. If the file format changes or the artifact is missing, the code silently defaults to 0 without logging what went wrong. This makes debugging extremely difficult.

No data validation: The code assumes all numeric values are valid. For example:

avg_cc = round(total_cc / cc_count, 2) if cc_count > 0 else 0

This guards against division by zero but doesn't validate that total_cc or cc_count are numbers. If radon outputs unexpected data, the code will crash with unhelpful errors.

Inconsistent rating logic: Rating calculations are duplicated across three places:

In the individual metric jobs (ci.yml)
In the badge generation script (coverage.yml lines 244-245)
Hardcoded in HTML/CSS (coverage.yml lines 387-390)

If you want to change what constitutes an 'A' rating, you need to update it in multiple places. The rating thresholds should be defined once as constants.

HTML injection without escaping: The back link injection (lines 200-219) uses string replacement to insert HTML into report files:

back_link = '<div style="position:fixed;top:10px;right:10px;z-index:9999;">...'
content = content.replace('</body>', back_link + '</body>')

This works because the HTML is hardcoded and trusted. However, it's fragile because it assumes all HTML files have a </body> tag, and it processes files with errors='ignore' which could corrupt non-UTF-8 files.

Badge generation uses external service: The script downloads badges from shields.io (lines 247-254). If shields.io is down or rate-limits the requests, badge generation fails. The code uses try/except to print errors but doesn't fail the workflow, so you might deploy a dashboard with missing badges.

HTML Generation

The dashboard HTML is generated as a heredoc (lines 562-988). The quality is mixed:

Good aspects:

Clean, modern CSS using system fonts
Responsive grid layout using CSS Grid
Accessible colour contrast (though not tested programmatically)
Semantic HTML structure
Chart.js for trend visualisation
API endpoint for programmatic access

Issues:

The HTML is 426 lines of code embedded in YAML, making it difficult to edit
No HTML validation or linting
No accessibility testing (missing ARIA labels, no keyboard navigation testing)
Metrics are loaded via JavaScript, so they don't work without JS
No error states shown to users if the API fetch fails
Chart library loaded from CDN without Subresource Integrity hash

JavaScript injection risks: The metrics loading code (lines 957-984) reads JSON from the API and directly inserts values into the DOM:

document.getElementById('bundle-size').textContent = data.bundle?.size_kb ?? '-';

This is safe because it uses textContent rather than innerHTML, but the pattern is fragile. If someone later changes this to innerHTML for formatting, it could introduce XSS vulnerabilities.

Historical Data Management

The history tracking (lines 489-559) fetches the existing history/all.json from the gh-pages branch, appends the current metrics, and keeps the last 90 entries.

Issues:

The git fetch (lines 130-136) uses 2>/dev/null to hide errors, making debugging difficult
Multiple entries on the same date are replaced (line 545), but there's no timestamp, so you lose intra-day history
No validation that dates are actually dates or in the correct format
The 90-day limit is hardcoded (line 552) rather than configurable

Data Accuracy Issues

1. Bundle size includes source maps

The bundle analysis (ci.yml lines 908-920) sums all files in dist/assets/:

if (fs.existsSync(distDir)) {
  fs.readdirSync(distDir).forEach(file => {
    const filePath = path.join(distDir, file);
    const stats = fs.statSync(filePath);
    totalSize += stats.size;

Vite generates source maps with the .map extension by default. These files are typically as large as or larger than the code itself, but they're never served to users in production. Including them in bundle size metrics is misleading.

For example, if the actual JavaScript is 200KB but the source map is 800KB, the metric reports 1000KB when users only download 200KB.

Fix: Filter out .map files:

if (file.endsWith('.map')) return; // Skip source maps

2. No gzipped transfer sizes

The bundle metric reports uncompressed bytes. Web servers serve JavaScript and CSS with gzip or brotli compression, which typically reduces size by 70-80% for text files. A 1000KB uncompressed bundle might transfer as only 200KB.

The metric doesn't reflect what users actually download, making it hard to judge real-world performance impact.

Fix: Use a tool like gzip-size or calculate compressed sizes:

const zlib = require('zlib');
const gzipSize = zlib.gzipSync(content).length;

3. No validation of tool outputs

All metric extraction uses try/except: pass which assumes tools always output valid data. If radon crashes or pytest generates a malformed JSON file, the code silently defaults to 0 without warning.

Fix: Log errors and fail the workflow if data is invalid:

try:
    with open('coverage/backend/coverage.json') as f:
        d = json.load(f)
        backend_cov = round(d['totals']['percent_covered'], 1)
except FileNotFoundError:
    print("ERROR: Backend coverage file not found")
    backend_cov = 0
except (KeyError, json.JSONDecodeError) as e:
    print(f"ERROR: Failed to parse backend coverage: {e}")
    backend_cov = 0

4. Frontend complexity uses warning count, not actual complexity

The frontend complexity metric (ci.yml lines 521-567) runs ESLint with --rule '{"complexity": ["warn", 10]}' and counts how many functions trigger warnings. It doesn't calculate average complexity or identify the worst functions.

The backend complexity uses radon to calculate average cyclomatic complexity (a number like 3.5) and maintainability index. The frontend just counts warnings, making the two metrics incomparable.

Why this is misleading: A codebase with 100 functions at complexity 9 gets a perfect score (0 warnings), while a codebase with 1 function at complexity 11 gets 1 warning. The first codebase is much more complex overall but looks better.

5. History file uses date as key, losing intra-day data

The history tracking (coverage.yml line 545) removes existing entries for the same date:

history['entries'] = [e for e in history['entries'] if e.get('date') != date]
history['entries'].append(entry)

If you run CI twice in one day, the second run overwrites the first. You can't track changes within a day, which makes debugging difficult if metrics suddenly change.

Fix: Include timestamp or build number in the key.

Maintainability Concerns

Inline scripts vs separate files

The entire dashboard generation logic lives in YAML heredocs. This creates several problems:

No local testing: You can't run the scripts locally without copying them out of the YAML
No syntax checking: IDEs don't provide syntax highlighting or error detection for heredocs
No version control clarity: Git diffs show changes as YAML changes, making it hard to review code logic
No code reuse: The same rating logic is duplicated across jobs because scripts can't import shared code
Difficult debugging: When something breaks, you need to add debugging to the YAML, push, wait for CI to run, then check logs

Better approach: Move Python scripts to .github/scripts/generate-dashboard.py and call them from the workflow. This enables local testing, proper syntax checking, and easier debugging.

Duplication across jobs

The pattern for generating summary JSON is repeated 10 times across different jobs. Each time it reads a JSON file, extracts metrics, calculates a rating, and writes summary.json. The rating logic is duplicated, meaning changes require updating multiple places.

For example, the duplication rating appears in:

ci.yml line 623: frontend duplication job
ci.yml line 675: legacy duplication job
ci.yml line 728: backend duplication job

If you want to change the 'A' threshold from 3% to 5%, you need to update it three times.

Documentation quality

The workflows have minimal inline documentation. There are no comments explaining:

Why certain rating thresholds were chosen
Why backend complexity is informational but frontend complexity must pass
What the artifact retention policy is and why it's 30 days
How to add a new metric
How to debug when a metric is wrong

The README doesn't explain how to run the metrics locally or how the dashboard generation works.

Debugging difficulty

When a metric is wrong, debugging requires:

Finding which job generated it
Looking at CI logs to find the Python/Node output
Guessing which try/except block failed silently
Adding print statements to the YAML heredoc
Pushing a commit to trigger CI
Waiting for CI to run
Checking logs again

There's no way to run the dashboard generation locally with test data, and no unit tests for the metric extraction logic.

Security Concerns

pip-audit not in requirements.txt

The backend security job installs pip-audit at runtime (ci.yml line 815):

- name: Install dependencies
  run: |
    python -m pip install --upgrade pip
    pip install -r requirements.txt
    pip install pip-audit

The tool works in CI but isn't available for local development because it's not in requirements.txt. Developers can't run security checks locally before pushing, and they might install packages with vulnerabilities without knowing.

Impact: Low severity but annoying. Developers working on dependency updates can't verify they've fixed vulnerabilities locally.

Fix: Add pip-audit to requirements.txt or create a separate requirements-dev.txt for development tools.

Badge generation depends on external service

The dashboard generation downloads badges from shields.io (coverage.yml lines 247-254):

url = f'https://img.shields.io/badge/{label}-{value}-{color}'
urllib.request.urlretrieve(url, f'site/coverage/badges/{name}.svg')

If shields.io is down, rate-limits requests, or changes its API, badge generation fails. The error handling prints a message but doesn't fail the workflow, so you might deploy a dashboard with missing badges.

Impact: Low severity. Missing badges don't break functionality, just make the dashboard look incomplete.

Fix: Generate SVG badges locally using a template rather than downloading them.

Chart.js loaded from CDN without SRI

The dashboard loads Chart.js from a CDN (coverage.yml line 870):

<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.1/dist/chart.umd.min.js"></script>

There's no Subresource Integrity (SRI) hash, so if the CDN is compromised or serves unexpected content, the dashboard could load malicious JavaScript.

Impact: Low severity for a metrics dashboard, but still a security best practice to add SRI.

Fix: Add the integrity attribute:

<script src="..." integrity="sha384-..." crossorigin="anonymous"></script>

No input validation on artifact data

The dashboard generation assumes all data in artifacts is trustworthy. It doesn't validate that percentages are between 0-100, that file counts are non-negative, or that dates are actually dates.

If a malicious actor could control the artifacts (through a compromised CI job or dependency), they could inject arbitrary data into the dashboard. The dashboard displays this data without sanitisation, though the use of textContent instead of innerHTML prevents XSS.

Impact: Very low severity. Artifacts are only created by CI jobs in the same repository, and they'd need write access to compromise those jobs.

Recommendations

Critical

Fix bundle size calculation to exclude source maps (ci.yml lines 908-920)
- Add a check to skip files ending in .map
- This is reporting 1413KB when the actual served bundle is likely under 500KB
- Current rating is misleading and could hide real bundle bloat
Add pip-audit to requirements.txt or requirements-dev.txt (ci.yml line 815)
- Developers can't run security checks locally
- Makes it harder to fix vulnerabilities during development
Replace bare except: pass with proper error handling (throughout coverage.yml)
- Current code silently fails and defaults to 0
- Makes debugging extremely difficult when metrics are wrong
- At minimum, log what went wrong

High Priority

Move Python scripts to separate files in .github/scripts/
- Current inline heredocs are difficult to test and debug
- No syntax highlighting, no local execution, no code reuse
- Makes maintenance and extension much harder
Calculate gzipped bundle sizes, not just raw sizes (ci.yml lines 892-932)
- Current metric doesn't reflect what users download
- A 1807KB bundle might compress to 200KB but looks huge in metrics
Make ESLint failures actually fail the build (ci.yml line 69, 77, 83)
- The || true pattern allows warnings to accumulate
- Code quality degrades over time with no enforcement
Extract rating thresholds into constants
- Rating logic is duplicated 10+ times across workflows
- Changes require updates in multiple places
- Risk of inconsistent ratings
Fail dashboard deployment if critical data is missing
- Current continue-on-error: true silently deploys incomplete dashboards
- At minimum, require coverage and bundle artifacts

Medium Priority

Add documentation explaining the CI setup
- How to add a new metric
- Why certain thresholds were chosen
- How to debug when metrics are wrong
- How to run metrics locally
Add SRI hashes to CDN resources (coverage.yml line 870)
- Security best practice for loading external scripts
- Low risk but easy to fix
Add timestamps to history entries, not just dates (coverage.yml line 514)
- Current system loses intra-day history
- Makes debugging difficult if metrics change during the day
Generate SVG badges locally instead of downloading from shields.io
- Removes external dependency
- Faster and more reliable
Validate artifact data before using it
- Check that percentages are 0-100
- Check that counts are non-negative
- Check that required fields exist
- Log warnings for unexpected values

Low Priority

Add caching for Node.js in backend duplication jobs
- Currently installs jscpd fresh each time
- Would speed up CI runs slightly
Unify frontend and backend complexity metrics
- Frontend uses warning count, backend uses actual cyclomatic complexity
- Makes the two metrics incomparable
- Consider using a real complexity analyser for TypeScript
Add unit tests for metric calculation logic
- Extract calculation code to functions
- Test with mock data
- Catch regressions when changing rating thresholds
Add accessibility testing to the generated HTML
- Check colour contrast ratios
- Add ARIA labels
- Test keyboard navigation
- Validate HTML with W3C validator

Summary

The metrics tooling collects a comprehensive set of quality indicators across three different codebases using appropriate tools for each. The pipeline is well-architected with parallel execution, artifact-based communication, and separation between collection and reporting. The dashboard itself is functional and provides good visualisations with historical trends.

However, the implementation has significant code quality issues. The entire dashboard generation system is embedded in YAML heredocs totalling over 800 lines of inline Python and JavaScript. Error handling is essentially absent, with bare except: pass blocks throughout that silently fail and default to zero. This makes debugging extremely difficult when metrics are wrong.

Several metrics report misleading data. The bundle size includes source maps that are never served to users, making the actual bundle appear 5-7x larger than it is. The bundle size also reports uncompressed bytes rather than gzipped transfer sizes, which is what actually impacts performance. The frontend complexity metric counts warnings rather than measuring actual complexity, making it incomparable to the backend metric.

The rating logic is duplicated across 10+ locations in the code, making it easy to introduce inconsistencies when changing thresholds. The tool configuration is inconsistent, with pip-audit working in CI but not locally because it's not in requirements.txt.

Grade: C - Functional but needs significant refactoring

The system works and provides value, but the code quality makes it fragile and difficult to maintain. The misleading bundle size metric is particularly problematic because it makes the frontend look much worse than it actually is. The inline scripts and poor error handling create significant technical debt that will make future improvements difficult.

Key Issues:

Bundle size includes source maps, reporting 1413KB instead of the actual ~200KB served to users
Over 800 lines of Python/JavaScript embedded in YAML heredocs with no testing or proper error handling
Bare except: pass throughout makes debugging nearly impossible when metrics are wrong
pip-audit works in CI but not locally, making it hard to fix vulnerabilities during development
Rating logic duplicated 10+ times across the codebase

Strengths:

Comprehensive metric collection covering 12 different quality indicators
Well-designed CI pipeline with parallel execution and artifact-based communication
Historical tracking with trend graphs provides good visibility into quality changes over time