Metrics Tooling & CI Setup Review
Repository: https://github.com/matthewdeaves/cookie Commit: cfaafa8 Files Reviewed:
- .github/workflows/ci.yml (977 lines)
- .github/workflows/coverage.yml (1014 lines)
This review was generated by Claude Opus 4.5 through direct analysis of the CI workflow files and metrics collection scripts. The review examines code quality, data accuracy, and maintainability of the metrics tooling infrastructure itself.
Overview
The Cookie repository uses two GitHub Actions workflows to collect code quality metrics and publish them to a dashboard on GitHub Pages. The CI workflow runs 15 separate jobs collecting metrics for coverage, complexity, duplication, security, bundle size, and linting across frontend React/TypeScript, backend Django/Python, and legacy ES5 JavaScript codebases. The coverage workflow then downloads all the artifacts, generates badges, creates a historical record, and publishes an interactive HTML dashboard with trend graphs.
The system collects 12 distinct metrics from various tools: Vitest for frontend coverage, pytest-cov for backend coverage, radon for Python complexity, ESLint for JavaScript complexity/linting, jscpd for duplication detection, npm audit and pip-audit for security scanning, and Vite for bundle analysis.
CI Pipeline Architecture
Job Organization
The CI workflow defines 15 jobs split across three categories:
Linting (3 jobs):
frontend-lint: ESLint on React/TypeScript frontendbackend-lint: ruff check on Django backendlegacy-lint: ESLint on ES5 legacy JavaScript
Testing (3 jobs):
frontend-typecheck: TypeScript compiler checkfrontend-test: Vitest with coveragebackend-test: pytest with coverage
Metrics Collection (7 jobs):
backend-complexity: radon cyclomatic complexity and maintainability indexfrontend-complexity: ESLint complexity warningsfrontend-duplication: jscpd on TypeScript/Reactbackend-duplication: jscpd on Pythonlegacy-duplication: jscpd on ES5 JavaScriptfrontend-security: npm auditbackend-security: pip-auditfrontend-bundle: Vite build size analysis
Validation (1 job):
ci-success: Checks all jobs passed
Publishing (1 workflow):
coverage.yml: Runs on CI completion, downloads artifacts, generates dashboard
Parallelisation
All metric collection jobs run in parallel with no dependencies between them, which is efficient. The ci-success job uses needs: to wait for 11 jobs to complete before validating their results. The coverage workflow runs separately via workflow_run trigger after CI completes successfully.
Each job runs independently and uploads its results as artifacts with 30-day retention. The separate coverage workflow then downloads all artifacts and processes them together.
Good Patterns
The pipeline has several well-designed aspects:
- Separation of concerns: Metric collection runs in CI, dashboard generation runs in a separate workflow. This prevents CI failures if dashboard generation breaks.
- Parallel execution: All metric jobs run simultaneously rather than sequentially, making efficient use of GitHub Actions minutes.
- Artifact-based communication: Jobs communicate through artifacts rather than job outputs, which is more robust for large data.
- Path filtering: The workflow skips for changes to markdown, docs, and plans, avoiding unnecessary runs.
- Concurrency control: The
concurrencysection cancels in-progress runs when new commits arrive. - Defensive programming: Most metric jobs use
|| trueto prevent metrics failures from blocking builds.
Issues Found
Several problems exist with the job structure:
ESLint warnings don't fail the build: The frontend linter (line 48) runs npm run lint which will fail on errors. However, the legacy lint job runs ESLint with || true three times (lines 69, 77, 83), which means warnings and errors are logged but don't fail the build. This allows lint issues to accumulate over time.
Inconsistent failure handling: The ci-success job checks that core jobs passed (lines 950-957) and frontend/legacy quality checks passed (lines 958-969), but treats backend complexity and duplication as informational (lines 970-976). The rationale for this inconsistency is unclear.
No dependency caching for Node tools: The legacy lint job and duplication jobs install npx jscpd fresh each time. While this ensures the latest version, it adds time to each run. The frontend jobs properly use npm cache via cache: 'npm' (line 41), but legacy and backend jobs using Node.js don't.
Artifact download failures are silent: The coverage workflow uses continue-on-error: true for all artifact downloads (lines 29, 38, 47, etc.). If an artifact is missing, the dashboard generation continues but displays incorrect or stale data without warning the user.
Metrics Collection Scripts
Coverage Collection
Frontend: The frontend-test job runs Vitest with npm run test:coverage, which generates coverage in multiple formats. The configuration is clean and straightforward.
Backend: The backend-test job runs pytest with explicit coverage configuration using --cov=apps --cov=cookie to measure both directories. It generates XML, HTML, JSON, and terminal reports. The configuration is explicit and correct.
The coverage extraction in the dashboard (coverage.yml lines 256-270) reads coverage-summary.json for frontend and coverage.json for backend. The error handling uses try/except: pass which silently fails if files are missing, defaulting to 0% coverage.
Complexity Analysis
Backend: The backend-complexity job runs four radon commands generating cyclomatic complexity, maintainability index, raw metrics, and Halstead metrics. The Python script (lines 332-493) calculates average complexity across all functions and average maintainability across all files.
The rating system is defined as:
- CC rating: A if ≤5, B if ≤10, C if ≤20, else D
- MI rating: A if ≥80, B if ≥60, C if ≥40, else D
Frontend: The frontend-complexity job runs ESLint with complexity rules inline. It counts complexity warnings but doesn't calculate actual complexity values.
The rating is based solely on warning count: A if 0 warnings, B if ≤5, C if ≤10, else D. This is a weak metric compared to the backend's actual cyclomatic complexity measurement.
Duplication Detection
All three duplication jobs use jscpd with similar patterns but different thresholds:
- Frontend (lines 596-604): 50 tokens, 5 lines minimum
- Backend (lines 700-709): 50 tokens, 5 lines minimum
- Legacy (lines 649-656): 30 tokens, 4 lines minimum (more sensitive)
The summary extraction (lines 607-628, lines 712-733, lines 659-680) reads the jscpd JSON report and extracts the duplication percentage. The error handling creates an empty report structure if parsing fails.
Rating thresholds vary by codebase:
- Frontend/Backend: A if ≤3%, B if ≤5%, C if ≤10%, else D
- Legacy: A if ≤5%, B if ≤10%, C if ≤15%, else D (more lenient)
Bundle Analysis
The frontend-bundle job builds the frontend and analyses the dist/assets directory. The Node.js script (lines 898-932) walks the directory and sums file sizes.
Critical issue: The script includes ALL files in the directory, including source maps:
fs.readdirSync(distDir).forEach(file => {
const filePath = path.join(distDir, file);
const stats = fs.statSync(filePath);
const sizeKB = Math.round(stats.size / 1024 * 100) / 100;
totalSize += stats.size;
if (file.endsWith('.js')) jsSize += stats.size;
if (file.endsWith('.css')) cssSize += stats.size; This code counts .js and .css files separately but adds ALL files to totalSize, including .map files which aren't served to users. The actual bundle served to users would be significantly smaller.
The rating thresholds are also misleading because they include source maps: A if <500KB, B if <1000KB, C if <2000KB, else D. A 1807KB bundle might rate as 'C' when the actual served bundle (excluding source maps) would be under 500KB and rate as 'A'.
No gzipped sizes: The bundle is reported as uncompressed bytes. Modern web servers serve JavaScript and CSS with gzip or brotli compression, typically reducing size by 70-80%. A 1807KB uncompressed bundle might transfer as only 200-300KB. The metric doesn't reflect this.
Security Scanning
Frontend (lines 761-790): Runs npm audit and extracts vulnerability counts from the JSON output. The rating is: D if any critical, C if any high, B if any moderate, else A. This is sensible.
Backend (lines 817-864): Installs pip-audit via pip install pip-audit (line 815) rather than including it in requirements.txt. This means pip-audit works in CI but not locally, making it harder to test security fixes during development.
The pip-audit parser (lines 828-864) handles two different JSON formats, which suggests the output format changed between versions. The rating is: A if 0 vulnerabilities, B if ≤2, C if ≤5, else D. Unlike npm audit, it doesn't use severity levels because pip-audit doesn't report them in the JSON format being parsed.
Dashboard Generation Code
The dashboard generation happens entirely in coverage.yml via embedded Python and Bash scripts.
Code Quality Issues
Inline Python in YAML heredocs: The entire dashboard generation consists of Python code embedded in YAML heredocs (lines 200-219, lines 222-487, lines 490-559). This makes the code difficult to test, debug, and maintain. The scripts can't be run locally without copying them out of the YAML file, and there's no syntax highlighting or IDE support.
Error handling swallows all exceptions: Nearly every data extraction uses this pattern:
try:
with open('coverage/frontend/coverage-summary.json') as f:
d = json.load(f)
frontend_cov = d['total']['lines']['pct']
except: pass The bare except: pass catches ALL exceptions including syntax errors, file not found, JSON parse errors, and KeyError. If the file format changes or the artifact is missing, the code silently defaults to 0 without logging what went wrong. This makes debugging extremely difficult.
No data validation: The code assumes all numeric values are valid. For example:
avg_cc = round(total_cc / cc_count, 2) if cc_count > 0 else 0 This guards against division by zero but doesn't validate that total_cc or cc_count are numbers. If radon outputs unexpected data, the code will crash with unhelpful errors.
Inconsistent rating logic: Rating calculations are duplicated across three places:
- In the individual metric jobs (ci.yml)
- In the badge generation script (coverage.yml lines 244-245)
- Hardcoded in HTML/CSS (coverage.yml lines 387-390)
If you want to change what constitutes an 'A' rating, you need to update it in multiple places. The rating thresholds should be defined once as constants.
HTML injection without escaping: The back link injection (lines 200-219) uses string replacement to insert HTML into report files:
back_link = '<div style="position:fixed;top:10px;right:10px;z-index:9999;">...'
content = content.replace('</body>', back_link + '</body>') This works because the HTML is hardcoded and trusted. However, it's fragile because it assumes all HTML files have a </body> tag, and it processes files with errors='ignore' which could corrupt non-UTF-8 files.
Badge generation uses external service: The script downloads badges from shields.io (lines 247-254). If shields.io is down or rate-limits the requests, badge generation fails. The code uses try/except to print errors but doesn't fail the workflow, so you might deploy a dashboard with missing badges.
HTML Generation
The dashboard HTML is generated as a heredoc (lines 562-988). The quality is mixed:
Good aspects:
- Clean, modern CSS using system fonts
- Responsive grid layout using CSS Grid
- Accessible colour contrast (though not tested programmatically)
- Semantic HTML structure
- Chart.js for trend visualisation
- API endpoint for programmatic access
Issues:
- The HTML is 426 lines of code embedded in YAML, making it difficult to edit
- No HTML validation or linting
- No accessibility testing (missing ARIA labels, no keyboard navigation testing)
- Metrics are loaded via JavaScript, so they don't work without JS
- No error states shown to users if the API fetch fails
- Chart library loaded from CDN without Subresource Integrity hash
JavaScript injection risks: The metrics loading code (lines 957-984) reads JSON from the API and directly inserts values into the DOM:
document.getElementById('bundle-size').textContent = data.bundle?.size_kb ?? '-'; This is safe because it uses textContent rather than innerHTML, but the pattern is fragile. If someone later changes this to innerHTML for formatting, it could introduce XSS vulnerabilities.
Historical Data Management
The history tracking (lines 489-559) fetches the existing history/all.json from the gh-pages branch, appends the current metrics, and keeps the last 90 entries.
Issues:
- The git fetch (lines 130-136) uses
2>/dev/nullto hide errors, making debugging difficult - Multiple entries on the same date are replaced (line 545), but there's no timestamp, so you lose intra-day history
- No validation that dates are actually dates or in the correct format
- The 90-day limit is hardcoded (line 552) rather than configurable
Data Accuracy Issues
1. Bundle size includes source maps
The bundle analysis (ci.yml lines 908-920) sums all files in dist/assets/:
if (fs.existsSync(distDir)) {
fs.readdirSync(distDir).forEach(file => {
const filePath = path.join(distDir, file);
const stats = fs.statSync(filePath);
totalSize += stats.size; Vite generates source maps with the .map extension by default. These files are typically as large as or larger than the code itself, but they're never served to users in production. Including them in bundle size metrics is misleading.
For example, if the actual JavaScript is 200KB but the source map is 800KB, the metric reports 1000KB when users only download 200KB.
Fix: Filter out .map files:
if (file.endsWith('.map')) return; // Skip source maps 2. No gzipped transfer sizes
The bundle metric reports uncompressed bytes. Web servers serve JavaScript and CSS with gzip or brotli compression, which typically reduces size by 70-80% for text files. A 1000KB uncompressed bundle might transfer as only 200KB.
The metric doesn't reflect what users actually download, making it hard to judge real-world performance impact.
Fix: Use a tool like gzip-size or calculate compressed sizes:
const zlib = require('zlib');
const gzipSize = zlib.gzipSync(content).length; 3. No validation of tool outputs
All metric extraction uses try/except: pass which assumes tools always output valid data. If radon crashes or pytest generates a malformed JSON file, the code silently defaults to 0 without warning.
Fix: Log errors and fail the workflow if data is invalid:
try:
with open('coverage/backend/coverage.json') as f:
d = json.load(f)
backend_cov = round(d['totals']['percent_covered'], 1)
except FileNotFoundError:
print("ERROR: Backend coverage file not found")
backend_cov = 0
except (KeyError, json.JSONDecodeError) as e:
print(f"ERROR: Failed to parse backend coverage: {e}")
backend_cov = 0 4. Frontend complexity uses warning count, not actual complexity
The frontend complexity metric (ci.yml lines 521-567) runs ESLint with --rule '{"complexity": ["warn", 10]}' and counts how many functions trigger warnings. It doesn't calculate average complexity or identify the worst functions.
The backend complexity uses radon to calculate average cyclomatic complexity (a number like 3.5) and maintainability index. The frontend just counts warnings, making the two metrics incomparable.
Why this is misleading: A codebase with 100 functions at complexity 9 gets a perfect score (0 warnings), while a codebase with 1 function at complexity 11 gets 1 warning. The first codebase is much more complex overall but looks better.
5. History file uses date as key, losing intra-day data
The history tracking (coverage.yml line 545) removes existing entries for the same date:
history['entries'] = [e for e in history['entries'] if e.get('date') != date]
history['entries'].append(entry) If you run CI twice in one day, the second run overwrites the first. You can't track changes within a day, which makes debugging difficult if metrics suddenly change.
Fix: Include timestamp or build number in the key.
Maintainability Concerns
Inline scripts vs separate files
The entire dashboard generation logic lives in YAML heredocs. This creates several problems:
- No local testing: You can't run the scripts locally without copying them out of the YAML
- No syntax checking: IDEs don't provide syntax highlighting or error detection for heredocs
- No version control clarity: Git diffs show changes as YAML changes, making it hard to review code logic
- No code reuse: The same rating logic is duplicated across jobs because scripts can't import shared code
- Difficult debugging: When something breaks, you need to add debugging to the YAML, push, wait for CI to run, then check logs
Better approach: Move Python scripts to .github/scripts/generate-dashboard.py and call them from the workflow. This enables local testing, proper syntax checking, and easier debugging.
Duplication across jobs
The pattern for generating summary JSON is repeated 10 times across different jobs. Each time it reads a JSON file, extracts metrics, calculates a rating, and writes summary.json. The rating logic is duplicated, meaning changes require updating multiple places.
For example, the duplication rating appears in:
- ci.yml line 623: frontend duplication job
- ci.yml line 675: legacy duplication job
- ci.yml line 728: backend duplication job
If you want to change the 'A' threshold from 3% to 5%, you need to update it three times.
Documentation quality
The workflows have minimal inline documentation. There are no comments explaining:
- Why certain rating thresholds were chosen
- Why backend complexity is informational but frontend complexity must pass
- What the artifact retention policy is and why it's 30 days
- How to add a new metric
- How to debug when a metric is wrong
The README doesn't explain how to run the metrics locally or how the dashboard generation works.
Debugging difficulty
When a metric is wrong, debugging requires:
- Finding which job generated it
- Looking at CI logs to find the Python/Node output
- Guessing which try/except block failed silently
- Adding print statements to the YAML heredoc
- Pushing a commit to trigger CI
- Waiting for CI to run
- Checking logs again
There's no way to run the dashboard generation locally with test data, and no unit tests for the metric extraction logic.
Security Concerns
pip-audit not in requirements.txt
The backend security job installs pip-audit at runtime (ci.yml line 815):
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install pip-audit The tool works in CI but isn't available for local development because it's not in requirements.txt. Developers can't run security checks locally before pushing, and they might install packages with vulnerabilities without knowing.
Impact: Low severity but annoying. Developers working on dependency updates can't verify they've fixed vulnerabilities locally.
Fix: Add pip-audit to requirements.txt or create a separate requirements-dev.txt for development tools.
Badge generation depends on external service
The dashboard generation downloads badges from shields.io (coverage.yml lines 247-254):
url = f'https://img.shields.io/badge/{label}-{value}-{color}'
urllib.request.urlretrieve(url, f'site/coverage/badges/{name}.svg') If shields.io is down, rate-limits requests, or changes its API, badge generation fails. The error handling prints a message but doesn't fail the workflow, so you might deploy a dashboard with missing badges.
Impact: Low severity. Missing badges don't break functionality, just make the dashboard look incomplete.
Fix: Generate SVG badges locally using a template rather than downloading them.
Chart.js loaded from CDN without SRI
The dashboard loads Chart.js from a CDN (coverage.yml line 870):
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.1/dist/chart.umd.min.js"></script> There's no Subresource Integrity (SRI) hash, so if the CDN is compromised or serves unexpected content, the dashboard could load malicious JavaScript.
Impact: Low severity for a metrics dashboard, but still a security best practice to add SRI.
Fix: Add the integrity attribute:
<script src="..." integrity="sha384-..." crossorigin="anonymous"></script> No input validation on artifact data
The dashboard generation assumes all data in artifacts is trustworthy. It doesn't validate that percentages are between 0-100, that file counts are non-negative, or that dates are actually dates.
If a malicious actor could control the artifacts (through a compromised CI job or dependency), they could inject arbitrary data into the dashboard. The dashboard displays this data without sanitisation, though the use of textContent instead of innerHTML prevents XSS.
Impact: Very low severity. Artifacts are only created by CI jobs in the same repository, and they'd need write access to compromise those jobs.
Recommendations
Critical
-
Fix bundle size calculation to exclude source maps (ci.yml lines 908-920)
- Add a check to skip files ending in
.map - This is reporting 1413KB when the actual served bundle is likely under 500KB
- Current rating is misleading and could hide real bundle bloat
- Add a check to skip files ending in
-
Add pip-audit to requirements.txt or requirements-dev.txt (ci.yml line 815)
- Developers can't run security checks locally
- Makes it harder to fix vulnerabilities during development
-
Replace bare
except: passwith proper error handling (throughout coverage.yml)- Current code silently fails and defaults to 0
- Makes debugging extremely difficult when metrics are wrong
- At minimum, log what went wrong
High Priority
-
Move Python scripts to separate files in
.github/scripts/- Current inline heredocs are difficult to test and debug
- No syntax highlighting, no local execution, no code reuse
- Makes maintenance and extension much harder
-
Calculate gzipped bundle sizes, not just raw sizes (ci.yml lines 892-932)
- Current metric doesn't reflect what users download
- A 1807KB bundle might compress to 200KB but looks huge in metrics
-
Make ESLint failures actually fail the build (ci.yml line 69, 77, 83)
- The
|| truepattern allows warnings to accumulate - Code quality degrades over time with no enforcement
- The
-
Extract rating thresholds into constants
- Rating logic is duplicated 10+ times across workflows
- Changes require updates in multiple places
- Risk of inconsistent ratings
-
Fail dashboard deployment if critical data is missing
- Current
continue-on-error: truesilently deploys incomplete dashboards - At minimum, require coverage and bundle artifacts
- Current
Medium Priority
-
Add documentation explaining the CI setup
- How to add a new metric
- Why certain thresholds were chosen
- How to debug when metrics are wrong
- How to run metrics locally
-
Add SRI hashes to CDN resources (coverage.yml line 870)
- Security best practice for loading external scripts
- Low risk but easy to fix
-
Add timestamps to history entries, not just dates (coverage.yml line 514)
- Current system loses intra-day history
- Makes debugging difficult if metrics change during the day
-
Generate SVG badges locally instead of downloading from shields.io
- Removes external dependency
- Faster and more reliable
-
Validate artifact data before using it
- Check that percentages are 0-100
- Check that counts are non-negative
- Check that required fields exist
- Log warnings for unexpected values
Low Priority
-
Add caching for Node.js in backend duplication jobs
- Currently installs jscpd fresh each time
- Would speed up CI runs slightly
-
Unify frontend and backend complexity metrics
- Frontend uses warning count, backend uses actual cyclomatic complexity
- Makes the two metrics incomparable
- Consider using a real complexity analyser for TypeScript
-
Add unit tests for metric calculation logic
- Extract calculation code to functions
- Test with mock data
- Catch regressions when changing rating thresholds
-
Add accessibility testing to the generated HTML
- Check colour contrast ratios
- Add ARIA labels
- Test keyboard navigation
- Validate HTML with W3C validator
Summary
The metrics tooling collects a comprehensive set of quality indicators across three different codebases using appropriate tools for each. The pipeline is well-architected with parallel execution, artifact-based communication, and separation between collection and reporting. The dashboard itself is functional and provides good visualisations with historical trends.
However, the implementation has significant code quality issues. The entire dashboard generation system is embedded in YAML heredocs totalling over 800 lines of inline Python and JavaScript. Error handling is essentially absent, with bare except: pass blocks throughout that silently fail and default to zero. This makes debugging extremely difficult when metrics are wrong.
Several metrics report misleading data. The bundle size includes source maps that are never served to users, making the actual bundle appear 5-7x larger than it is. The bundle size also reports uncompressed bytes rather than gzipped transfer sizes, which is what actually impacts performance. The frontend complexity metric counts warnings rather than measuring actual complexity, making it incomparable to the backend metric.
The rating logic is duplicated across 10+ locations in the code, making it easy to introduce inconsistencies when changing thresholds. The tool configuration is inconsistent, with pip-audit working in CI but not locally because it's not in requirements.txt.
Grade: C - Functional but needs significant refactoring
The system works and provides value, but the code quality makes it fragile and difficult to maintain. The misleading bundle size metric is particularly problematic because it makes the frontend look much worse than it actually is. The inline scripts and poor error handling create significant technical debt that will make future improvements difficult.
Key Issues:
- Bundle size includes source maps, reporting 1413KB instead of the actual ~200KB served to users
- Over 800 lines of Python/JavaScript embedded in YAML heredocs with no testing or proper error handling
- Bare
except: passthroughout makes debugging nearly impossible when metrics are wrong - pip-audit works in CI but not locally, making it hard to fix vulnerabilities during development
- Rating logic duplicated 10+ times across the codebase
Strengths:
- Comprehensive metric collection covering 12 different quality indicators
- Well-designed CI pipeline with parallel execution and artifact-based communication
- Historical tracking with trend graphs provides good visibility into quality changes over time