My Virtual Pentest and DevOps Team

A lot of people are building apps with AI right now. Some of it is full-on vibe coding, some of it is more deliberate. Either way, if you're putting something live you have to at least try to secure it.

I built Rockport (an LLM proxy on AWS) with Claude Code. I've got Cloudflare Access and a WAF in front of it, but I wanted to go further than perimeter defences. So I grabbed some open source tools (testssl.sh, ffuf, shellcheck, nmap, nuclei), got Claude Code to help me build a pentest suite around them, and started attacking my own infrastructure. This post is about the tooling that came out of that.

#The suite

I've added a pentest suite to Rockport. It's thirteen bash scripts that I put together with Claude Code. Most of them are just curl. A few use testssl.sh, ffuf, nmap, nuclei, or the AWS CLI.

The suite is driven by a YAML target config that documents the full attack surface: 34 endpoints across six groups, the WAF allowed paths from Terraform, the Cloudflare Tunnel routes, and any known risks or false positives I don't want flagged again.

The orchestrator creates a throwaway API key with a $0.50 budget, runs every module, revokes the key at the end, and dumps structured JSON and a markdown summary. I used Claude to help me dig into the features of each tool, tested it, reviewed the results and iterated from there.

#Skills

As I iterated on the suite with Claude Code I kept rewriting the same prompts: view the results, classify findings against the known risks, check if the WAF module covers new rules. After a while I asked Claude to look at our chat history and identify where I was repeating myself. It pointed out a few patterns, so we turned those into Claude Code skills.

#/pentest

Runs the scan. I tell it which module I want (or all thirteen), it checks the required tools are installed, runs the suite, and pulls a summary from results.json with jq. I get a markdown report with pass/fail counts per module instead of digging through log files.

#/pentest-review

Triages existing results without re-running anything. Finds the latest report, reads only the modules with failures, classifies each finding against the YAML config as NEW, CONFIRMED, REGRESSION, FALSE POSITIVE, or INCONCLUSIVE, and tells me why anything was skipped. If I come back in a month and want to know what the last scan found, I type /pentest-review and get a summary.

#/pentest-align

/pentest-align keeps the suite in sync with the infrastructure. I kept adding WAF rules in Terraform or new endpoints to the sidecar and forgetting to update the tests. This skill reads the .tf files, the sidecar code, and the LiteLLM config, then compares what it finds against what the pentest suite actually tests. When there's a gap it updates the target config and module scripts to match.

#/rockport-ops

The devops side. /rockport-ops handles deployments, debugging, and health checks. If something goes wrong it works through a triage: instance state, service health, logs, and external reachability. It has reference files with the diagnostic commands and common issues so I'm not googling the same AWS CLI incantations every time. When it finds something it routes the fix through my speckit pipeline and runs a 48-assertion smoke test before it's done.

#/factcheck-docs

/factcheck-docs audits the project docs against the codebase. If I change a port number in Terraform and forget to update the README, it catches it.

#Guardrails for the pentest scripts

Claude Code kept putting bugs into the pentest scripts. The same bugs, over and over. So I set up a PreToolUse hook to catch them before they land.

The main one is set -e. I've got a rule in my project constitution (Constitution VI) that says pentest scripts must use explicit error handling. Pentest modules are supposed to keep going when individual tests fail. A failing test is a finding, not a reason to stop. set -e kills the whole module on the first non-zero exit code, so you get a partial scan and miss everything after the first failure.

Claude kept adding it anyway. Every few sessions it would "tidy up" a script and stick set -euo pipefail at the top. The hook catches it and warns Claude before the edit goes through.

It also catches:

  • ((var++)) patterns. In bash, ((0++)) evaluates to 0, which is falsy under set -e. If a counter starts at zero the first increment kills the script. Claude loves this for test counters.
  • exit 1 after tool-missing checks. If ffuf isn't installed the paths module should skip with a [SKIP] tag and exit 0, not fail the scan. Claude defaults to exit 1 for "tool not found" which makes the orchestrator report a failure instead of a skip.
  • local inside loops. Bash scopes local to the function, not the block. It doesn't do what it looks like, and Claude puts it there often enough to be worth catching.

There's also a PostToolUse hook that runs shellcheck after any pentest script edit. It excludes SC2016 (dollar signs in single-quoted strings) because some pentest scripts like the injection module use curl format strings like '%{http_code}' in single quotes. Without the exclusion, shellcheck flags them and Claude tries to "fix" them by double-quoting, which breaks the scripts.

#The loop

Day to day it works like this. /pentest-align to make sure the suite matches the infrastructure. /pentest to scan. Fix what comes up. Commit the fix, deploy it with /rockport-ops, and check the service is still healthy. /pentest-review to check results without re-scanning. Fix any tests that were wrong. /pentest-align again to catch anything that drifted while I was fixing things.

It's not a substitute for a proper security review, but it gives me repeatable coverage that I can run any time I change something.

The repo is public. If you've got suggestions or spot something I've missed, raise an issue.


Updated 7 April 2026 to reference the latest work on the pentest suite and Rockport codebase.