Three Levels of AI-Assisted Pentesting

In April I spent a month pentesting two of my own projects with AI assistance. I started with ad-hoc code review and ended up with two harnesses I can leave running overnight. Along the way I found 35 distinct security issues and fixed all of them. This post is the primer I wish I had at the start.

#The setup

The two projects are Cookie, a Django app that runs my recipe collection on passkey-only auth and is now live at cookie.matthewdeaves.com, and appserver, the AWS infrastructure that hosts it. Cookie is public. Appserver is private. There are two prior posts in this series if you want the context.

Cookie is around 200 Python files and 80 API endpoints, with an AI features layer behind daily quotas, a recipe scraper that fetches from a domain allowlist, and a React SPA frontend. It runs on a single EC2 instance behind Cloudflare Access, fronted by a Cloudflare Tunnel, with Postgres, nginx, Traefik, and gunicorn doing the heavy lifting.

I'm not a security engineer, just a solo developer running a real production service. AI development had sped up how fast I was shipping code and, as it turned out, how fast I was introducing security bugs too. I needed something that could keep up. The three levels below are how I found it.

#Level 1: ask Claude

Open a Claude Code session, point it at your codebase, and ask it to look for security issues. You can paste a single endpoint and ask what's wrong with it, ask for a review of a whole file or feature area, or just start a session and ask Claude to look through the project for anything concerning.

It's a reasonable first step. Claude can reason carefully about code it hasn't seen before and will often catch things you've been looking at too long to notice.

But there are real limits. There's no tooling, no exploration, no running system. Claude can only work with what you give it and reason about what you ask. You're entirely reliant on knowing what to ask about in the first place. The IDOR bugs (where one user can access another user's data by guessing an ID) on the recipe write endpoints didn't come up because I didn't think to ask about them. Level 1 gets you the obvious wins quickly if you're starting from scratch. Past that, the ceiling is low.

#Level 2: a curated suite, wrapped in a harness

I have a pentest suite for Rockport that I wrote about in the 5 April post. For Cookie I built a more thorough version — 14 bash modules, a YAML target config with 86 endpoints tracked, and four Claude Code skills (/pentest, /pentest-align, /pentest-review, plus /threat-ops which stays private with appserver). Most modules are curl wrapped around nuclei, nikto, ffuf, testssl.sh, nmap, and sqlmap.

Running the suite is one command:

./pentest/pentest.sh run-all

It takes 15–30 minutes. Findings come out as JSON. The /pentest-review skill triages them: NEW / CONFIRMED / REGRESSION / FALSE POSITIVE / INCONCLUSIVE.

That's a single run. The interesting bit is what happens when you wrap it in a loop.

#The Level 2 harness

There's a shell script next to the suite. It runs the scan, hands the findings to Claude for triage, applies fixes, validates them, and expands the test coverage so the same class of bug gets caught next time. Then it runs again. It exits when two consecutive iterations come back with zero new findings.

The driver is a couple of hundred lines, mostly logging, convergence checks, and result parsing. The interesting line is this one:

timeout 7200 claude -p "$(cat pentest/HARNESS_PROMPT.md)" \
    --dangerously-skip-permissions \
    --allowedTools 'Bash,Read,Write,Edit,Glob,Grep,Agent' \
    --add-dir "$HOME/cookie" \
    --model opus

Three things stand out:

--dangerously-skip-permissions — no approval prompts. Claude acts on everything without pausing for me.
--add-dir ~/cookie — Claude has read/write access to the Cookie codebase, not just to appserver where the harness lives.
timeout 7200 — two-hour hard kill. If Claude hasn't finished an iteration by then, the harness retries.

⚠

--dangerously-skip-permissions disables all tool permission checks. Only run it in an isolated environment. Anthropic's safer alternative is --permission-mode auto, which routes tool calls through a classifier before they execute — though it requires a Max, Team, or Enterprise plan and is currently a research preview.

State persists in loop-state.json between iterations. Each iteration is a fresh Claude session with no memory of the prior one beyond what's written to disk. This keeps context from degrading and means each iteration produces a coherent record.

What this looks like in practice: on 21 April the harness ran five iterations. The first one wrote its state at 18:35 UTC, the last at 22:23 UTC. In between it found two MEDIUM rate-limit bypass bugs, opened two pull requests on Cookie, merged them itself, and converged. I was watching a movie.

#When a fix introduces a new bug

Both bugs show why the loop matters more than a single-pass scan. Iteration 3 found that get_client_ip() used the leftmost X-Forwarded-For value, which clients can inject through Cloudflare. The harness spawned a subagent, the subagent created PR #102, the PR switched the function to use CF-Connecting-IP, the subagent merged it, the harness re-validated. PASS.

Iteration 4 ran the same probe against the deployed fix. CF-Connecting-IP can also be injected by clients. Cloudflare does not strip the header on its way through. Five out of five bypasses confirmed. The harness opened PR #103 with the correct fix (rightmost XFF, which Cloudflare does strip), merged it, and re-validated. PASS.

Two iterations, two security bugs, two PRs, one of them caused by the previous fix. A single-pass scan misses this entirely. The fix-introduces-bug pattern only appears if you re-validate, and most scanning tools only run once.

#When the suite converges

Convergence means the deterministic tests I have are clean for two iterations running. It doesn't mean there's nothing left to find. The suite can only test what it was told to test. By 21 April it had nothing new to say.

#Level 3: HexStrike

Once the deterministic suite is clean, that's the right time for HexStrike — for finding what the suite wasn't told to test.

HexStrike is an MCP server (Model Context Protocol, Claude Code's standard for connecting to external tool providers) that gives Claude Code autonomous access to 150+ pentesting tools. It runs in a Docker container. Claude drives tools like nuclei, ffuf, sqlmap, and others covering SQL injection, XSS, HTTP smuggling, JWT attacks, and parameter discovery. The container ships with the full toolkit; what Claude focuses on is controlled by the brief and the harness prompt.

Using it feels less like running a tool and more like having a security researcher look over the codebase for a few hours.

#Kickoff prompts

Each round starts with a kickoff prompt. By round 17 these had grown to 21KB of structured Markdown. Not conversational prompts but bespoke penetration test plans, with target version, rate-limit budgets, auth setup steps, section-by-section probe plans, prioritisation, cleanup instructions, and a report template.

You can also embed coordination steps for moments when you need to be physically present. The round 16 kickoff has this:

WIFI SWITCH REQUIRED FOR THIS TEST

Before running A2, pause and tell the user:

"Please switch to a different WiFi network now. Your current IP has a CF Access bypass baked into the Zero Trust policy that bypasses ALL path-level rules regardless of WAF allowlist state. Once you've switched networks and your IP has changed, run curl -sS -4 https://api.ipify.org to confirm the new IP, then tell me and I will continue."

I needed to test whether the /.well-known/* Cloudflare Access bypass was correctly scoped to security.txt only. From my home IP I couldn't tell. My home IP had a separate bypass rule baked into the Zero Trust policy regardless of WAF state. So the prompt encoded the workflow: pause, ask the user to switch WiFi, wait for confirmation, run the probes from the new IP, ask the user to switch back, re-add the IP to the WAF allowlist. Claude followed it exactly. The test confirmed PASS.

Encoding "stop here, a physical action from me is needed" into an otherwise autonomous session turns out to be genuinely useful. Most of the time Claude just runs. At the point where I actually needed to do something, it stopped and waited.

#Findings only HexStrike finds

The deterministic suite found the configuration and hardening problems: security headers that weren't being set, response bodies leaking server internals, HTTP endpoints accepting methods they shouldn't, and gaps in how scanner traffic was being blocked. HexStrike found the things that needed actual adversarial behaviour: concurrent requests, injected headers, and a virtual passkey to drive the authentication flow. Three examples:

Quota race condition on the AI tips endpoint. Cookie limits how many times per day each user can request AI-generated tips. The probe sent five requests simultaneously to an endpoint that had a cached response. All five succeeded and the daily counter landed at -1 rather than decrementing by one. Three bugs contributed: the cache initialisation was not atomic under concurrent writes, the quota increment did a read-then-update without holding a lock, and the release path decremented unconditionally even when nothing was consumed. Fixing it required three separate changes. Commit.

Passkey login accepted requests from an attacker-controlled origin. Cookie's WebAuthn implementation verifies that login requests come from the correct origin (the real website, not an attacker-controlled one). Four configuration decisions that each made sense individually stacked into a bypass. Django was set to USE_X_FORWARDED_HOST=True for port handling behind Traefik. localhost was in ALLOWED_HOSTS for local development. Cloudflare passes client-supplied X-Forwarded-Host headers through to the origin unchanged. And Traefik trusts the 127.0.0.0/8 range for forwarded headers. An attacker could inject X-Forwarded-Host: localhost with a WebAuthn assertion claiming the origin was https://localhost, and the origin check would pass. Self-login only. This couldn't be used to take over someone else's account, but the origin binding is a core part of what makes passkeys secure, and it was broken. Fix: stop deriving the expected origin from the request and hardcode it from an environment variable.

Starting a registration flow could extend the expiry window of an in-progress login. Cookie's passkey implementation has three flows that each issue a time-limited challenge: registering a new passkey, logging in with one, and adding a second passkey. All three were writing their challenge timestamp into the same session key. Starting a registration flow after requesting a login challenge would overwrite the timestamp, resetting the 300-second expiry clock on the in-flight login. Fix: a separate session key for each flow. Found in round 23, fixed in v1.64.0.

None of these would have shown up in a code review of any single file. Each one only existed because two separate, independently reasonable things interacted in a way neither was designed to handle. Static analysis sees nothing wrong with either part on its own. The deterministic suite can't reach them either. They need concurrent requests, header injection through a specific proxy chain, or a stateful sequence across the passkey ceremony.

#The HexStrike harness

HexStrike rounds started as manual sessions driven by a kickoff prompt file. By late April I was running four or five rounds in a single long session. Each round generated a self-contained Python probe script, ran it inside the container, wrote a report, fixed any findings, and moved on.

I've formalised that pattern into a harness:

./pentest/hexstrike/harness.sh [max-rounds] [nosleep]

It runs a fresh Claude session per round (3-hour cap), reads the existing reports to determine the next round number, spawns the round, and loops. Convergence is the same rule as Level 2: two consecutive rounds with zero new findings. The main difference from the Level 2 harness is the sleep between rounds. The WebAuthn canary uses about ten of the twenty per-hour login-verify allowance, so the harness waits until at least 60 minutes have passed since the previous round started. A long round gets zero sleep. A short round waits long enough for the budget to reset. Pass nosleep as the second arg to disable the sleep entirely, useful when rounds are running long enough that the budget has already reset.

After each round the harness parses the run log for [PASS]/[FAIL] counts and shows them inline alongside the state summary, so you get a quick read on probe coverage without opening the log file. Any fixes that fail tests are tracked in failed_fixes in loop-state.json and surfaced both mid-run and in the final summary so they aren't buried.

The companion repo has both harness implementations.

#What was found

I closed 35 distinct security findings across both projects between 1 April and 28 April.

Severity	Count
Critical	0
High	3
Medium	9
Low	20
Info / hardening	3

The two layers found very different things. The regression suite picked up configuration and hardening problems: missing security headers, nginx settings leaking server internals, endpoints accepting HTTP methods they shouldn't, and gaps in scanner detection. HexStrike found the access control and authentication bugs: endpoints missing auth checks, rate limits that could be bypassed by spoofing a header, and the passkey-ceremony issues that needed a virtual authenticator to reach.

Broken down by the OWASP Top 10 (the standard classification framework for web vulnerabilities):

Category	Count
Broken Access Control — users reaching data or actions they shouldn't	9
Security Misconfiguration — wrong settings, missing headers, exposed internals	13
Authentication / Session — passkey ceremony bugs, session handling	6
Cryptographic Failures — TLS and key handling issues	2
Insecure Design — structural issues that can't be patched away	1
Infrastructure / IaC — cloud config and network hardening	4

The three most serious bugs all came from the most recent rounds, not because they were harder to find but because finding them required a virtual WebAuthn authenticator. This is software that plays the role of a hardware passkey, generating cryptographically signed authentication assertions programmatically. On 24 April I wrote one — a small wrapper around the open-source soft_webauthn library that adds the user-verification flag the upstream library doesn't set — and dropped it into my HexStrike container. Before that, six rounds of HexStrike (R13–R18) couldn't meaningfully probe the passkey ceremony because there was no way to actually drive it. You can't test what happens after a successful passkey login if you can't perform one. Once it was in place, the coverage went from "checked the response codes" to actually manipulating the authentication data, and three new vulnerabilities appeared within a couple of rounds.

Broad scanning doesn't help much if you can't reach the surface that matters. The serious bugs were in the passkey ceremony. I couldn't find them until I had a way to actually drive it.

#What it cost

I'm on a Claude Max subscription, so my real out-of-pocket for April was the flat monthly fee. But Claude Code keeps a transcript of every session, and the open-source ccusage tool can parse those transcripts and apply Anthropic's per-token pricing. That gives a useful "what would this have cost on the API" figure for anyone considering this approach without a subscription.

I went through April's transcripts and pulled out the sessions whose first prompt clearly came from a pentest activity — Level 2 harness iterations, HexStrike round kickoffs, and the pentest skills. Here's what those sessions would have cost on the API in dollars:

Activity	Sessions	Total	Median per run
Level 2 harness iteration	35	$138	$2.12
HexStrike round (manual kickoff)	5	$30	$5.30
`/pentest` skill	1	$17	$17.30
`/pentest-align` skill	5	$50	$6.11
`/pentest-review` skill	5	$53	$10.75
Total	51	$288	—

A few things those numbers tell you:

Level 2 iterations are cheap because the bash scan suite does most of the heavy lifting. Claude only gets called for triage, fixes, and re-validation. A clean iteration that finds nothing is a couple of dollars; a busy one with three fixes can hit the high twenties.
HexStrike rounds run for up to three hours each. The prompts include the full briefs and prior reports, and Claude is doing the actual probing rather than just reading scan output. They cost more per session than a Level 2 iteration but I run far fewer of them — five manual rounds across the month versus thirty-five Level 2 iterations.
/pentest-align reads across the whole codebase and the target config to find places they've drifted apart. Worth running maybe once a week, not on every deploy.
/pentest-review just classifies findings from a recent scan, but the scan output is large, so the input tokens add up.

Which model does what. The cost differences also track which model actually ran each activity, and it isn't what I'd have guessed before checking the transcripts:

Activity	Main thread	Sub-agents
Level 2 harness iteration	Opus 4.6 (via `--model opus`)	Haiku 4.5
HexStrike round (manual)	Sonnet 4.6 (CLI default)	Haiku 4.5
`/pentest`	Opus 4.6	Mix of Opus 4.6 and Haiku 4.5
`/pentest-align`	Opus 4.6	Haiku 4.5
`/pentest-review`	Opus 4.6	Haiku 4.5

The HexStrike rounds in April were manual claude invocations, which default to Sonnet 4.6 rather than Opus. That's why they came in cheaper per session than the skills, even though they're doing more work — Sonnet is roughly five times cheaper per token than Opus. The Level 2 harness passes --model opus explicitly so its main thread runs Opus, but the sub-agents the harness spawns for fix work mostly fall back to Haiku, which keeps the per-iteration cost down. If I ran the HexStrike harness with --model opus instead of manual Sonnet rounds, those round numbers would be roughly three to five times higher.

The table only counts sessions I could classify with confidence. There's also pentest-driven fix work in the Cookie repo — rate-limit tuning, hardening, dependabot triage — that I couldn't cleanly separate from regular development. With that included, the realistic total for April was somewhere between $400 and $500 in API-equivalent terms.

Compute didn't add anything. HexStrike runs in Docker on my Ubuntu desktop. The Cookie EC2 instance was already up for the live site, so the pentest didn't add to that bill.

If you were running this on the API directly with no subscription, the rough rules:

A Level 2 regression iteration on every deploy: a few dollars per deploy.
HexStrike rounds weekly: thirty to fifty dollars a month if you're doing four to eight rounds.
The pentest skills: invoke as needed, five to thirty dollars each.

For my use, the Max subscription covers all of this comfortably. If I were on pay-as-you-go API pricing, the HexStrike rounds would be the part I'd budget for most carefully.

#Setting this up for your own repo

If I were starting from scratch today, here's the order I'd actually recommend. This is the inverse of the narrative above. Go straight to HexStrike and let findings drive what you regress against, not the other way round. Don't try to do all of it at once. And don't save it for the end. Security issues in established code get harder to fix as surface area grows.

Start running probes as soon as you have a working prototype, and bake each finding back into your Claude setup as a rule, hook, or skill so the same class of mistake can't sneak back in. Cookie's Django security rules and template safety hook are direct outputs of this process. I have posts on project-specific Claude Code customisations and setting up guardrails if that pattern is new to you.

#Step 1: get HexStrike running locally

The upstream HexStrike repo is just the Python MCP server that orchestrates the tools. It expects roughly 150 pentest binaries (nmap, nuclei, sqlmap, ffuf, hydra, gobuster, and so on) to already be on the host, which in practice means running it on Kali Linux or another distro pre-loaded with the toolkit. I work on Ubuntu, so I built pentest-kit to package the whole thing up. Its Dockerfile bakes a kalilinux/kali-rolling base image with the web-app toolkit pre-installed, then HexStrike runs inside that container.

Don't trim the toolkit per-project. The way you control what HexStrike focuses on is through the brief and the harness prompt, not by removing tools from the image. An attacker wouldn't limit their toolkit, and Claude follows the scope you give it.

git clone https://github.com/matthewdeaves/pentest-kit
cd pentest-kit
./setup.sh          # installs jq, shellcheck, gh; checks docker + claude
cd hexstrike
./launch.sh --rebuild   # first build: 15–20 minutes
./launch.sh             # subsequent starts

claude mcp add hexstrike docker -- exec -i hexstrike-ai \
    python3 /opt/hexstrike-ai/hexstrike_mcp.py \
    --server http://localhost:8888

#Step 2: write the briefs

Two living markdown documents that describe the attack surface: briefs/yourapp.md for the application, briefs/yourinfra.md for the infrastructure. Include the target URL, the auth model, the rate limits, the public endpoint surface, and any known-fixed findings. These get updated between rounds. They're what carries context from one session to the next.

You can ask Claude to draft the initial brief from your codebase: point it at your API routes, auth middleware, and settings, and ask it to document the attack surface, auth model, rate limits, and anything worth noting about the infrastructure. It won't get everything right first time but it gets you 80% of the way there quickly.

#Step 3: run the first round manually

Don't try to write a generic harness prompt yet. Write a kickoff prompt specific to round one: reference the briefs, list the auth setup steps, list the sections you want probed, specify the report format. Run it.

Claude handles the whole round — writing the probe scripts, running them inside the container, producing the report, and proposing fixes. Your job is to read the output, judge whether the findings look real, and update the brief. Do this a few times until you have a feel for what a good round looks like.

#Step 4: add the HexStrike harness

Once the round structure feels repeatable, generalise your kickoff prompt into a HARNESS_PROMPT.md that drives one round without you in the room. Wrap it in a harness.sh that loops fresh Claude sessions until two consecutive rounds come back clean. The companion repo has a working implementation you can adapt.

The HARNESS_PROMPT.md in the companion repo includes a Phase 5 where Claude writes a regression test for every finding it fixes, then runs the test suite to confirm the fix holds before merging. You get tests automatically as the harness runs — you don't write them by hand.

#Step 5: add a curated regression suite as findings accumulate

Every HexStrike finding should become a regression test. Don't try to build the curated suite first. Let the findings drive what you regress against.

If you're using the harness, you're already getting tests — Phase 5 of the harness prompt writes one for each finding as it goes. But you can also work through historical reports explicitly: point Claude at the exploratory report files and ask it to write a bash test for each finding, add it to the suite, and confirm it passes against the current codebase. Something like "read hexstrike/exploratory-report-r3.md, write a regression test for each finding in the suite, and run the suite to confirm they all pass." Claude can do that in a single session.

Once you have a suite worth running, the /pentest-align skill compares it against your codebase and flags gaps — endpoints that exist but aren't tested, auth boundaries that changed, rate limits that drifted. It's useful both for initial setup and for keeping things in sync as the app evolves.

Eventually you have enough deterministic tests to wrap them in a regression harness that runs on every deploy. That layer is cheap (15–30 min of bash, no LLM tokens), so it goes on the deploy hook. The HexStrike harness runs less often, weekly or after a major feature addition.

#Step 6: monitor CVEs on dependencies

Pentesting finds your bugs. CVE monitoring catches everyone else's. Trivy in CI plus Dependabot covers most of it. This is plumbing, not AI work.

Running costs are low relative to what a real external engagement would cost. A regression suite run is a couple of hours of Claude Opus time. A HexStrike round is longer with larger prompts and more tool calls. Both are cheap enough to run regularly. Level 2 goes on the deploy hook, HexStrike weekly or on-demand.

#What I keep private and why

The "many eyeballs make bugs shallow" argument assumes eyeballs you actually have. Cookie is open source and nobody's watching it closely. All 35 findings came from running adversarial tooling against my own infrastructure. For a small solo project, open source doesn't give you the security benefit the argument assumes. AI tooling has made the attacker side of that equation considerably cheaper, while the defence work still falls entirely on the maintainer. Running adversarial tooling on a schedule is what this post is about. Dependency CVE monitoring catches what I can't pen-test for: known vulnerabilities in packages I've pulled in, though neither is a substitute for a proper professional audit.

I keep appserver private. That's the AWS infrastructure-as-code, the Cloudflare Access setup, the WAF rules, the Traefik config, the deploy scripts, and the entire pentest playbook including all the kickoff prompts and reports. Releasing any of that tells an attacker exactly what I have already tested for and what I have not. The HexStrike kickoff prompts and the exploratory reports are detailed enough that they constitute a guided tour of the live attack surface. I'm not comfortable releasing it while the service is live. I'm not a security expert and I can't be confident I've found everything. Publishing a detailed map of what I've tested and what I haven't is not a trade-off I'm willing to make.

I kept Cookie public, but that decision sits less comfortably now than it did a month ago, particularly since the app is deployed on the open internet. The most serious bugs required interacting with the live site to confirm — source access doesn't directly reveal them. But source access lowers the cost of attack: an attacker reading the code understands the auth model, the quota system, the session handling, the WebAuthn configuration without having to reverse-engineer any of it from observed behaviour. That trade-off only works if the project has people actively watching it — multiple developers, regular maintenance, eyes on the code. For a solo project with no contributors and no active maintenance, I'd keep the source private.

Open source remains the right default for projects with real community involvement. For solo projects with no contributors, the case is weaker than the standard arguments suggest, and the AI-assisted attacker tooling is what changed it.

#Wrap-up

The templates from this post live at github.com/matthewdeaves/pentest-kit. The pentest-kit project page has the full inventory of what's in it.

Cookie is live at cookie.matthewdeaves.com with source at github.com/matthewdeaves/cookie. If you spot something I've missed, raise an issue.

Updated 29 April 2026 to add a section on what the pentesting actually cost, broken down by activity, with totals from ccusage against the JSONL transcripts Claude Code keeps locally. Also added a note on which model (Opus, Sonnet, Haiku) each activity actually ran on, since that matters more than I'd realised.