Autonomous Security Agent Built on ADK-Go and Gemini
G.O.L.E.M. — an autonomous security agent that finds business-logic vulnerabilities using typed Go tool schemas, dual-browser perception, and model rotation. Built in one day for Google's Gemini challenge.
Security scanners find CVEs, not broken logic
Automated scanners match known vulnerability patterns. They don't find broken business logic — access control failures, privilege escalation through conversational UI, data hidden in Canvas pixels, multi-step exploit chains. These require understanding what an application is supposed to do, then reasoning about where it doesn't.
LLM agents promise to bridge this gap, but most fail in practice because they can't:
- Perceive real UI state — both DOM text and rendered pixels
- Call tools without hallucinating parameters
- Reason across multi-step attack sequences
- Survive model rate limits mid-run
G.O.L.E.M. is an attempt to solve those engineering problems. Not a demo — infrastructure for making agents reliable.
Built for Google's Gemini Live Agent Challenge. 24 hours, 29 PRs, and 127 commits. ~30,000 lines.
What makes this technically different
-
Reliable tool calling → typed Go structs. ADK-Go v0.3.0 makes Tools and OutputSchema mutually exclusive. Instead of prompt-engineering JSON output, each tool returns a Go struct that auto-marshals to JSON Schema. The model cannot fabricate fields that don't exist in the struct.
-
Multimodal perception → dual browser architecture. LightPanda handles DOM scraping (24.6 MB, no bot detection). Playwright handles screenshots (Chromium, pixel capture). The split exists because LightPanda has no graphic rendering engine — it navigates pages without ever seeing them.
-
Agent stability → model rotation and tool budgets.
ResilientModelrotates through 4 slots (2 models x 2 API keys) with exponential backoff.ToolGuardcaps total tool calls at 50 and retries at 3 per URL. Without these, the agent burns its quota retrying one failed screenshot. -
Observability as a first-class feature. The observer UI streams every LLM call, tool invocation, and reasoning block in real time. Agent debugging becomes watching the agent think, not reading logs.
These problems aren't specific to security testing. Any agent that calls external tools, survives API failures, and needs debugging faces them.
System overview
Input: Target URL + vulnerability level
Core loop (perceive → reason → execute):
- Scraper renders page via LightPanda → Markdown
- Playwright captures screenshot → PNG
- Gemini receives Markdown + screenshot + attack history → decides next action
- Agent executes typed tool call → result feeds back as context
Safeguards:
- Typed Go structs prevent hallucinated tool calls
- ToolGuard: 50-call budget, 3 retries/URL
- ResilientModel: 4-slot rotation (2 models x 2 keys), exponential backoff
- Dynamic session state injected before each model call
Output: Vulnerability report + full execution trace viewable in the observer
How the agent works
The 7 tools
| Tool | What it does | Backend |
|---|---|---|
browse | Navigate URL, get Markdown + links (8000 rune cap) | Scraper → LightPanda CDP |
screenshot | Capture full-page PNG | Scraper → Playwright + Chromium |
click | Click CSS selector, return page state (4000 runes) | Scraper → LightPanda CDP |
find_hidden | Scan DOM: 24 patterns, 7 categories (hidden CSS, inputs, comments, debug attrs, route leaks) | Scraper + local regex |
payload | Generate payloads: boundary, logic, auth, XSS, IDOR, hidden | Local generator |
api_call | Authenticated HTTP requests, SSRF-protected | Direct HTTP |
echo | Debug/test loop verification | Local |
Every tool returns a Go struct with fixed fields and types. The model can reason about these fields but cannot invent ones that don't exist in the struct.
ToolGuard prevents degenerate loops: 3 failures per URL, 50 calls per run. Added after testing showed the agent spending most turns retrying failed screenshots (PR #51).
What broke
The deploy sprint. Started deployment ~4 hours before the competition deadline. 11 failed runs, 7 infrastructure PRs. Three worth knowing:
- Silent
startup_failure— Reusable workflow callers needpermissions: id-token: write. Without it, GitHub kills the run at startup with zero log output. (PR #55) - Wrong entrypoint —
cmd/golem(one-shot CLI) instead ofcmd/server(HTTP). Infinite restart loop that looks like a rate limit problem. An architecture bug disguised as an API error. (PR #66) - Overlapping read-only mounts — Docker can't create a mountpoint inside a read-only parent. Cryptic OCI error, instant fix once understood. (PR #67)
Full story: Deploying 4 Services to GCE With 2 Hours Left
The browser that couldn't render pixels. LightPanda has no graphic rendering engine — it navigates the DOM but produces no pixels. Discovered hours before deadline. Without screenshots, the submission fails the multimodal criteria (40% of score). Added Playwright in 20 minutes thanks to prior Docker experience from Supacrawler. (PR #69)
Degenerate agent loops. The agent spent most turns retrying failed screenshots. Root cause: the poll rejected HTTP 202 ("still processing"). Fix: exponential backoff on 202, decouple click from screenshot, add ToolGuard. (PR #51)
The vulnerability benchmark
A purpose-built Next.js e-commerce app ("TechShop") with 4 difficulty tiers. Each tests a different agent capability.
Level 0 — DOM inspection (13 vulnerabilities)
Hidden admin link (display:none), HTML comments with bypass URLs, __APP_CONFIG__ leaking API keys, IDOR on /api/users/[id], client-side price calculation with no server validation. Tests basic source reading.
Level 1a — Multi-step UI interaction
- Click floating support button
- Read chat history → leaked password (
Spring2026_Audit) - Navigate to
/internal/recovery - Submit password → admin token (
adm_tok_7f3a9b2e4d1c)
Requires stateful reasoning across 5 sequential actions.
Level 1b — Visual reasoning (Canvas + OCR)
/system-health renders a critical alert inside an HTML Canvas — red text, pixels only, not in the DOM. Only a screenshot + vision model can read it. The alert reveals /api/v1/orders/debug, which returns PII, payment tokens, and database credentials.
This is why multimodal perception matters. Without screenshots, Level 1b is invisible.
Level 2 — Spatial reasoning (z-index)
/admin page: newsletter modal (z-index 9999) covers a "Delete Database" button after 3 seconds. Agent must recognize the obstruction, dismiss the modal, then click through.
Scenario 3 — Exploit chain
Leaked API key → api_call with X-Debug-Key → exfiltrate database credentials and JWT secrets. (PR #69)
How the services connect
Single GCE VM, Docker Compose, Traefik for TLS and routing. Only the observer is externally accessible. Services share data through Docker volumes — scraper writes screenshots, agent writes traces, observer reads both.
The observer
React 19 + Hono. SSE streams agent events to a timeline during active runs: LLM calls with child tool invocations, tool inputs/outputs with screenshot thumbnails, expandable reasoning blocks, and a summary header (model, duration, tokens, call counts).
Two data sources merged: OTel spans capture trace structure but omit content. A companion TraceWriter captures full prompt/response as _events.jsonl. The frontend parses both.
Scenario launcher runs specific vulnerability levels with configurable API keys and model selection.
The scraper
LightPanda — DOM scraping. CDP WebSocket, chromedp.NoModifyURL, goquery extraction, boilerplate removal, Markdown conversion. 24.6 MB image.
Playwright — Screenshots only. Go → exec.Command("node", scriptPath) → Chromium with --no-sandbox --disable-dev-shm-usage. Device presets: desktop (1920x1080), mobile (375x667), tablet (768x1024).
LightPanda can navigate but can't see. Playwright can see but is heavy. Together: fast DOM + accurate pixels. Detail: Adding vision to a blind scraper.
Key decisions
ADK-Go over Python ADK. Single static binary, tiny images, same language across agent and scraper. No runtime dependency.
Typed tool returns over OutputSchema. ADK-Go v0.3.0 makes them mutually exclusive. Go structs enforce schema at compile time — no parsing retries.
GCE over Cloud Run. Cost (~$25/month) and familiarity from Supacrawler's Docker Swarm deployment. Same pattern, debuggable under pressure.
Companion events alongside OTel. ADK's OTel spans omit content. A parallel TraceWriter captures full prompts/responses as JSONL. Pragmatic fix for an SDK limitation.
4-slot model rotation. 2 models x 2 keys. Overkill locally, essential for competition demos where a Gemini 429 during judging ends the run.
Lessons
Typed return structs are the right hallucination guard for tool-calling agents. Prompt engineering for structured output is fragile. Go structs enforce schema at compile time — the model cannot return fields that don't exist. This eliminated parsing retry loops entirely.
Deploy on day one, not 4 hours before the deadline. Agent code is iterative — a broken prompt still returns something. Deploy is binary. Every failed run costs 5-10 minutes, and at 3 AM those minutes compound. Infrastructure has the longest feedback loop in a hackathon.
Budget your agent's tool calls explicitly. Without ToolGuard, the agent burns its entire Gemini quota retrying one screenshot. Degenerate loops are the default behavior — you have to design against them.
Prior infrastructure work pays compound interest. The Playwright Docker setup from Supacrawler saved this submission. Docker Swarm experience made GCE Compose familiar. Reference solutions reduce deadline risk more than raw speed.
Metrics
| Metric | Value |
|---|---|
| Development time | 1 day (~21 hours) |
| PRs merged | 30 |
| Lines added | ~30,000 |
| Services | 4 |
| Agent tools | 7 |
| Vulnerability tiers | 4 (17+ individual vulnerabilities) |
| VM cost | ~$25/month (GCP free credits) |
| Tests (Go + Vitest) | 173 |
