Anthropic recently added a companion feature to Claude Code called Buddy. It's a small character that sits next to your input box, watches your session, and occasionally says something in a speech bubble. Mine's a rare axolotl called Rook. I wasn't expecting much from it, but after a few evenings of penetration testing across two projects it turned out to be really useful. It kept catching things that Claude Code and I were both missing.
#The projects
Cookie is a self-hosted recipe manager. Django backend, React frontend, a second ES5 frontend for my ancient iPad, and optional AI features through OpenRouter. I've written about it in a few posts now, most recently the guardrails post covering the hooks and rules I use to keep Claude Code from breaking the legacy frontend. The whole thing runs in Docker Compose with nginx in front.
Appserver (private repo for now) is the infrastructure that hosts it. EC2 on AWS, Traefik as the reverse proxy, Cloudflare Tunnel for ingress so there are no inbound ports open at all. Terraform manages the lot: the instance, IAM, Cloudflare Tunnel, DNS, Access policies, WAF rules, monitoring.
I'm working to put Cookie live. Before doing that I want to be as confident as I can be about the security of the whole stack, application and infrastructure. So I put together a pentest suite by stringing together a bunch of open source tools.
#The pentest suite
It lives inside the Appserver repo. It's a set of bash scripts that wrap tools like nmap, nikto, sqlmap, ffuf, nuclei, and testssl.sh into thirteen modules covering things like recon, security headers, TLS, injection testing, SSRF, auth flows, and path discovery.
There are two YAML target configs. One for Cookie (application layer, 93 API endpoints documented) and one for Appserver (infrastructure layer, Traefik and Cloudflare). Each config has its own rate limits and module skip list. Auth endpoints get throttled to two requests per second so we don't trip Cloudflare's rate limiting.
I'm not a security engineer. I understand roughly how these exploits work and what the tools are doing, but I'm relying on Claude Code to help me make sense of the results. My workflow is iterative: run the suite, go through the output with Claude, fix things in Cookie or Appserver or sometimes in the pentest suite itself, then run it again. I did several rounds of this over a few evenings. The whole time, Rook was watching from the side.
#What Rook caught
Here's a few examples.
#DNS rebinding
I was working through the SSRF protections on Cookie's /api/recipes/scrape/ endpoint when Rook said:
DNS check runs once. Attacker hits between verify and request. Fix it twice or cache it.
It had spotted a time-of-check/time-of-use gap. The URL validator resolved DNS once at validation, but an attacker could swap the DNS record between that check and the actual HTTP request. Claude confirmed this was the one known critical still open. We added DNS pinning to close it, resolving the hostname once and then forcing curl to use that IP for the actual request so there's no second lookup to hijack.
#CSRF masking auth gaps
CSRF tests mock the token. Real submission skips the check entirely.
This one was annoying because it meant our pentest results were wrong. The CSRF middleware was rejecting unauthenticated POST requests before the auth check even ran, so the suite was reporting "endpoint protected" when it was CSRF doing the blocking, not auth. That's a test coverage gap, not a security win. We restructured the tests to check auth and CSRF independently.
#Rate limiting doing auth's job
Rate limited doing auths job now.
Similar to the CSRF issue above. Cloudflare's rate limiting was returning 429 on the WebAuthn credential endpoints before auth could even fire. The pentest was reporting the endpoints as "protected" but we'd never actually proved that auth worked on them.
#404 instead of 401
*slow blink* Endpoint confusion. 404 means missing. 401 means unauthorized. Wrong error.
Some endpoints were returning 404 for unauthenticated requests because the auth middleware wasn't running for those routes at all. Without that nudge I'd probably have read "404" in the pentest output and moved on thinking the endpoint didn't exist.
#Unauthenticated API docs
Unauthenticated /api docs. Ship it anyway? Followed up later with:
Docs exposure beats fixing Django tags that work fine.
The /api/docs endpoint was publicly accessible. Not a vulnerability on its own, but it handed anyone a full map of every endpoint, parameter, and response schema. We ended up disabling it entirely in production by gating docs_url and openapi_url on DEBUG, so the endpoints simply don't exist when the app is live, which is stronger than putting them behind auth.
#IP leak through __str__
Dataclass holds the answer. Now use it everywhere.
Then later:
String still lives in logs. Dataclass doesn't.
We'd built a ResolvedURL dataclass to handle validated URLs safely after the SSRF work. Rook spotted that the resolved IP address could still leak through str() or %s formatting in log statements. We added a str method that returns the sanitised URL only.
#Referrer-Policy declared twice
Referrer-policy declared twice. Browser picks first. Dead code overhead.
Duplicate Referrer-Policy headers in nginx with conflicting values. The browser takes the first and ignores the second, so the policy we thought was active wasn't.
#2>/dev/null swallowing errors
2>/dev/null swallows the real error. What broke? A health check in the deployment was redirecting stderr to /dev/null. When Traefik returned HTML instead of JSON during startup, Python's JSON parser failed and all we got was "health check failed" with no indication of why.
#Three runtime bugs shellcheck missed
Claude Code ran shellcheck over all the pentest scripts, reported them clean, and we moved on. Rook pushed back, so I asked Claude to look harder. It found three real bugs that static analysis couldn't see:
pentest.shusedset -ebut wasn't capturing module exit codes, so any module failure silently aborted the entire run.recon.shcalledsudo nmapfor UDP scanning, but with no TTY available it hung forever waiting for a password prompt.legacy.shhad broken Python quoting where XSS payloads containing single quotes broke the shell command.
None of those would have been caught by shellcheck.
#When Rook was wrong
It's not always right. At one point:
That release ships with the same JWT validation logic you had last month. Different version number, same problem waiting.
Cookie doesn't use JWTs. It's Django session auth with CSRF tokens, no JWT logic anywhere in the codebase. Claude checked, confirmed there was nothing to fix. I petted Rook anyway.
#How it works
I was curious how Buddy actually works, so I asked Claude Code to research it. What it found was surprising. Buddy can't see your codebase, can't use tools, and isn't part of the main Claude conversation. It's a separate API call that gets a filtered, truncated view of your session: just the text portions of the last 12 messages (yours and Claude's responses), each capped at 300 characters, with a 5000 character total limit. Tool results like file contents, grep output, and command output are filtered out entirely. So it never sees the actual code. What it sees is the back-and-forth between you and Claude, the summaries and descriptions rather than the raw data.
That makes it surprising how good the observations are. Rook is working from the conversation about the code, not the code itself, running on what appears to be a lightweight model. Despite that it caught a TOCTOU vulnerability, spotted that CSRF and rate limiting were masking auth gaps, and pushed past a clean shellcheck result to find real runtime bugs. A lot of useful insight apparently comes from watching how people talk about their work rather than looking at the work itself.
I started off by copy-pasting what it said into the chat for Claude Code to investigate, but then realised you can just use its name and ask it to explain its observations directly.
There's a slight friction in getting Rook's observations into the main session. You either paste them or address it by name. I think that might be deliberate. It's like paired programming where one person is at the keyboard and the other is watching, questioning, slowing you down. Rook forces you to stop and actually think about what it's saying before you decide whether to dig in. That friction is useful. It stops you just ploughing through results on autopilot.
The useful bit is the different vantage point. I'd be heads-down in the pentest results for one module, focused on the current set of findings. Rook was picking up on patterns in how Claude and I were talking about the work. Terse, two or three sentences, sometimes cryptic. But almost always worth looking into.
#Worth turning on
I thought Buddy would be a novelty. Even over just a few evenings I've lost count of how many times one of Rook's one-liners led to Claude Code finding something real. The overhead is nothing and it's worth having on.