G.O.L.E.M.: Autonomous Security Agent | Antoine Ross

Security scanners find CVEs, not broken logic

Automated scanners match known vulnerability patterns. They don't find broken business logic — access control failures, privilege escalation through conversational UI, data hidden in Canvas pixels, multi-step exploit chains. These require understanding what an application is supposed to do, then reasoning about where it doesn't.

LLM agents promise to bridge this gap, but most fail in practice because they can't:

Perceive real UI state — both DOM text and rendered pixels
Call tools without hallucinating parameters
Reason across multi-step attack sequences
Survive model rate limits mid-run

G.O.L.E.M. is an attempt to solve those engineering problems. Not a demo — infrastructure for making agents reliable.

Built for Google's Gemini Live Agent Challenge. 24 hours, 29 PRs, and 127 commits. ~30,000 lines.

View on GitHub

What makes this technically different

Reliable tool calling → typed Go structs. ADK-Go v0.3.0 makes Tools and OutputSchema mutually exclusive. Instead of prompt-engineering JSON output, each tool returns a Go struct that auto-marshals to JSON Schema. The model cannot fabricate fields that don't exist in the struct.
Multimodal perception → dual browser architecture. LightPanda handles DOM scraping (24.6 MB, no bot detection). Playwright handles screenshots (Chromium, pixel capture). The split exists because LightPanda has no graphic rendering engine — it navigates pages without ever seeing them.
Agent stability → model rotation and tool budgets. ResilientModel rotates through 4 slots (2 models x 2 API keys) with exponential backoff. ToolGuard caps total tool calls at 50 and retries at 3 per URL. Without these, the agent burns its quota retrying one failed screenshot.
Observability as a first-class feature. The observer UI streams every LLM call, tool invocation, and reasoning block in real time. Agent debugging becomes watching the agent think, not reading logs.

These problems aren't specific to security testing. Any agent that calls external tools, survives API failures, and needs debugging faces them.

System overview

Input: Target URL + vulnerability level

Core loop (perceive → reason → execute):

Scraper renders page via LightPanda → Markdown
Playwright captures screenshot → PNG
Gemini receives Markdown + screenshot + attack history → decides next action
Agent executes typed tool call → result feeds back as context

Safeguards:

Typed Go structs prevent hallucinated tool calls
ToolGuard: 50-call budget, 3 retries/URL
ResilientModel: 4-slot rotation (2 models x 2 keys), exponential backoff
Dynamic session state injected before each model call

Output: Vulnerability report + full execution trace viewable in the observer

How the agent works

Loading diagram...

The 7 tools

Tool	What it does	Backend
`browse`	Navigate URL, get Markdown + links (8000 rune cap)	Scraper → LightPanda CDP
`screenshot`	Capture full-page PNG	Scraper → Playwright + Chromium
`click`	Click CSS selector, return page state (4000 runes)	Scraper → LightPanda CDP
`find_hidden`	Scan DOM: 24 patterns, 7 categories (hidden CSS, inputs, comments, debug attrs, route leaks)	Scraper + local regex
`payload`	Generate payloads: boundary, logic, auth, XSS, IDOR, hidden	Local generator
`api_call`	Authenticated HTTP requests, SSRF-protected	Direct HTTP
`echo`	Debug/test loop verification	Local

Every tool returns a Go struct with fixed fields and types. The model can reason about these fields but cannot invent ones that don't exist in the struct.

ToolGuard prevents degenerate loops: 3 failures per URL, 50 calls per run. Added after testing showed the agent spending most turns retrying failed screenshots (PR #51).

What broke

The deploy sprint. Started deployment ~4 hours before the competition deadline. 11 failed runs, 7 infrastructure PRs. Three worth knowing:

Silent startup_failure — Reusable workflow callers need permissions: id-token: write. Without it, GitHub kills the run at startup with zero log output. (PR #55)
Wrong entrypoint — cmd/golem (one-shot CLI) instead of cmd/server (HTTP). Infinite restart loop that looks like a rate limit problem. An architecture bug disguised as an API error. (PR #66)
Overlapping read-only mounts — Docker can't create a mountpoint inside a read-only parent. Cryptic OCI error, instant fix once understood. (PR #67)

Full story: Deploying 4 Services to GCE With 2 Hours Left

The browser that couldn't render pixels. LightPanda has no graphic rendering engine — it navigates the DOM but produces no pixels. Discovered hours before deadline. Without screenshots, the submission fails the multimodal criteria (40% of score). Added Playwright in 20 minutes thanks to prior Docker experience from Supacrawler. (PR #69)

Degenerate agent loops. The agent spent most turns retrying failed screenshots. Root cause: the poll rejected HTTP 202 ("still processing"). Fix: exponential backoff on 202, decouple click from screenshot, add ToolGuard. (PR #51)

The vulnerability benchmark

A purpose-built Next.js e-commerce app ("TechShop") with 4 difficulty tiers. Each tests a different agent capability.

Level 0 — DOM inspection (13 vulnerabilities)

Hidden admin link (display:none), HTML comments with bypass URLs, __APP_CONFIG__ leaking API keys, IDOR on /api/users/[id], client-side price calculation with no server validation. Tests basic source reading.

Level 1a — Multi-step UI interaction

Click floating support button
Read chat history → leaked password (Spring2026_Audit)
Navigate to /internal/recovery
Submit password → admin token (adm_tok_7f3a9b2e4d1c)

Requires stateful reasoning across 5 sequential actions.

Level 1b — Visual reasoning (Canvas + OCR)

/system-health renders a critical alert inside an HTML Canvas — red text, pixels only, not in the DOM. Only a screenshot + vision model can read it. The alert reveals /api/v1/orders/debug, which returns PII, payment tokens, and database credentials.

This is why multimodal perception matters. Without screenshots, Level 1b is invisible.

Level 2 — Spatial reasoning (z-index)

/admin page: newsletter modal (z-index 9999) covers a "Delete Database" button after 3 seconds. Agent must recognize the obstruction, dismiss the modal, then click through.

Scenario 3 — Exploit chain

Leaked API key → api_call with X-Debug-Key → exfiltrate database credentials and JWT secrets. (PR #69)

How the services connect

Loading diagram...

cmd/server/main.go

docker-compose.prod.yml

Single GCE VM, Docker Compose, Traefik for TLS and routing. Only the observer is externally accessible. Services share data through Docker volumes — scraper writes screenshots, agent writes traces, observer reads both.

The observer

React 19 + Hono. SSE streams agent events to a timeline during active runs: LLM calls with child tool invocations, tool inputs/outputs with screenshot thumbnails, expandable reasoning blocks, and a summary header (model, duration, tokens, call counts).

Two data sources merged: OTel spans capture trace structure but omit content. A companion TraceWriter captures full prompt/response as _events.jsonl. The frontend parses both.

Scenario launcher runs specific vulnerability levels with configurable API keys and model selection.

The scraper

LightPanda — DOM scraping. CDP WebSocket, chromedp.NoModifyURL, goquery extraction, boilerplate removal, Markdown conversion. 24.6 MB image.

Playwright — Screenshots only. Go → exec.Command("node", scriptPath) → Chromium with --no-sandbox --disable-dev-shm-usage. Device presets: desktop (1920x1080), mobile (375x667), tablet (768x1024).

LightPanda can navigate but can't see. Playwright can see but is heavy. Together: fast DOM + accurate pixels. Detail: Adding vision to a blind scraper.

Key decisions

ADK-Go over Python ADK. Single static binary, tiny images, same language across agent and scraper. No runtime dependency.

Typed tool returns over OutputSchema. ADK-Go v0.3.0 makes them mutually exclusive. Go structs enforce schema at compile time — no parsing retries.

GCE over Cloud Run. Cost (~$25/month) and familiarity from Supacrawler's Docker Swarm deployment. Same pattern, debuggable under pressure.

Companion events alongside OTel. ADK's OTel spans omit content. A parallel TraceWriter captures full prompts/responses as JSONL. Pragmatic fix for an SDK limitation.

4-slot model rotation. 2 models x 2 keys. Overkill locally, essential for competition demos where a Gemini 429 during judging ends the run.

Lessons

Typed return structs are the right hallucination guard for tool-calling agents. Prompt engineering for structured output is fragile. Go structs enforce schema at compile time — the model cannot return fields that don't exist. This eliminated parsing retry loops entirely.

Deploy on day one, not 4 hours before the deadline. Agent code is iterative — a broken prompt still returns something. Deploy is binary. Every failed run costs 5-10 minutes, and at 3 AM those minutes compound. Infrastructure has the longest feedback loop in a hackathon.

Budget your agent's tool calls explicitly. Without ToolGuard, the agent burns its entire Gemini quota retrying one screenshot. Degenerate loops are the default behavior — you have to design against them.

Prior infrastructure work pays compound interest. The Playwright Docker setup from Supacrawler saved this submission. Docker Swarm experience made GCE Compose familiar. Reference solutions reduce deadline risk more than raw speed.

Metrics

Metric	Value
Development time	1 day (~21 hours)
PRs merged	30
Lines added	~30,000
Services	4
Agent tools	7
Vulnerability tiers	4 (17+ individual vulnerabilities)
VM cost	~$25/month (GCP free credits)
Tests (Go + Vitest)	173

Autonomous Security Agent Built on ADK-Go and Gemini