Project·

Go Web Scraper with chromedp, LightPanda, and Asynq

Go scraper that converts JS-heavy pages to LLM-ready Markdown. Migrated from Playwright to chromedp + LightPanda, cutting the Docker image from 1.51 GB to 24.6 MB.

Client: Open Source / Internal ToolService: Open Source / Research

Problem

Most web scraping tools are built for data pipelines, not LLM workflows. They output raw HTML — which is token-expensive noise when you need clean text. And they batch: crawl the whole site, then scrape. For a large site, that's minutes before you see the first result.

I wanted a Go scraper that streamed results as pages were discovered, converted HTML to Markdown at the source, and could handle JS-heavy sites without a 1 GB Docker image.

View on GitHub →

Why It's Hard

JS rendering without a browser process. Most pages that need JavaScript either require Chromium (heavy) or fail silently with HTTP-only scraping. The trade-off between "it works" and "it fits in memory" is real, especially on a budget VPS.

Concurrent pipeline coordination. The crawler discovers links and pushes them to a buffered channel (size 256) while a separate goroutine pool (10-20 workers) drains it. Closing the channel too early panics. Not closing it leaks goroutines. Getting the lifecycle right across a multi-worker, multi-goroutine pipeline took several iterations.

Bot detection is a moving target. Basic HTTP requests get blocked by Cloudflare and similar CDNs immediately. The uTLS library (refraction-networking/utls v1.8.0) helps by rotating Chrome, Firefox, Safari, and Edge TLS fingerprints — but the "stealth" mode is currently disabled because LightPanda's obscurity already provides some natural bot avoidance.

Deployment budget. The full stack is Fiber API + Redis + LightPanda + Supabase on a 2 vCPU, 11.25 GB RAM VPS. Four scraper replicas at 0.2 CPU each (1536 MB RAM limit per replica). Any headless browser that's too heavy kills this deployment model.

Architecture

Loading diagram...

The scrape path is a 3-tier fallback. First try: LightPanda via chromedp's remote allocator over CDP, with a 15s timeout. LightPanda runs bundled inside the scraper container — no separate browser service. If it times out or fails, fall back to raw HTTP with a modern Chrome TLS fingerprint, then try again with a mobile fingerprint. The final output from any tier is converted to Markdown by JohannesKaufmann/html-to-markdown v1.6.0.

The crawl path is async: POST /crawl enqueues a task into Asynq (which uses Redis under the hood). Ten Asynq workers pick up tasks. Each worker runs a mapper that discovers links and pushes them into a buffered Go channel (256 slots). A separate pool of 10-20 goroutines reads from that channel and calls the same scrape path. You get results streaming out before the crawl completes.

Screenshots are a separate endpoint because LightPanda can't do graphics rendering — that still requires a real Chromium process via chromedp.

Key Engineering Decisions

Asynq over raw Redis commands. The original plan was RPUSH/BLPOP — simple, direct. Asynq (hibiken/asynq v0.25.1) wraps Redis with proper task lifecycle: retry policies, dead-letter queues, task deduplication, and a web UI. For a SaaS where jobs fail in non-obvious ways, that visibility was worth the abstraction.

LightPanda bundled in container. Rather than managing a separate browser service, LightPanda runs inside the scraper container itself. The binary is tiny (~17.7 MiB per instance) and starts fast. This simplifies the deployment topology at the cost of some isolation — a LightPanda crash takes the worker down with it. Acceptable for a research tool; I'd revisit for production SaaS.

uTLS fingerprint rotation. Instead of running Chromium for every request, the HTTP fallback path rotates TLS fingerprints using refraction-networking/utls. This gets past many CDN bot detections without browser overhead. The "stealth" obfuscation code exists in the repo but is disabled — LightPanda's obscurity already helps, and stealth mode added latency without clear benefit.

HTML → Markdown at the scraper layer. Converting output at the source means consumers get clean text immediately. For LLM ingestion, this cuts noise and token count. The JohannesKaufmann/html-to-markdown library handles most edge cases; the few it doesn't I handle with post-processing cleanup.

What Failed

The SaaS. I ran Supacrawler as a SaaS for five months. I shut it down because it wasn't making money and I wasn't having fun anymore. Those are the two reasons. There's no dramatic technical failure story — the infrastructure held up. I just misjudged whether this was a product people would pay for versus a tool developers would self-host.

Playwright's image size. The original version used playwright-go (community wrapper over the official Playwright SDK). The Docker image was 1.51 GB. On a VPS with 11.25 GB RAM and 4 replicas, that's manageable — but it made the build slow and the cold start painful. The migration to chromedp + LightPanda in November 2025 brought the image to 24.6 MB.

LightPanda's graphics limitation. LightPanda works for 90% of JS-heavy scraping, but it has no graphics rendering. Any site that needs canvas, WebGL, or screenshot-based extraction still requires a real Chromium process. The screenshots endpoint (POST /v1/screenshots) still uses chromedp directly. This was discovered post-migration when the screenshot tests started failing — it wasn't in LightPanda's documentation clearly.

Concurrency bugs in the mapper. The link discovery mapper uses a buffered channel as a work queue. Early versions had a race: the mapper goroutine would push links into the channel while the consumer goroutines were finishing up, and the channel close would race with a pending send. The fix was a sync.WaitGroup on the consumer side and explicit done signaling on the mapper side — but it took two separate data corruption bugs to get the sequencing right.

What I'd Change

Decide SaaS vs open source earlier. I accumulated billing, auth, and tenant isolation code in the same repo as the scraper core. That made the open-source version confusing to set up and contributions harder. I'd fork the repo at the first sign of SaaS features and keep the core clean.

Add OpenTelemetry from the first commit. Debugging a multi-tier fallback across 10 Asynq workers and 10-20 goroutines without traces is painful. You end up with log grep archaeology. Spans would have caught the LightPanda timeout patterns much earlier.

Validate the SaaS assumption before building. I built the SaaS infrastructure (billing, auth, dashboards) before confirming whether anyone wanted to pay for a scraping API rather than self-host. Five months of infra work to learn that lesson was expensive.

Key Lessons

A 60x image reduction changes the deployment model. The Playwright → LightPanda migration wasn't just a size win. Smaller images mean faster CI builds, faster Kubernetes pulls, cheaper registry storage, and a deployment topology that fits on smaller VPSes. If you're shipping a Go service with a bundled browser, the image size is a first-class constraint.

Go channels make streaming pipelines obvious. The pattern — mapper goroutine pushes to buffered channel, worker pool drains it — is not novel, but it's clean. You get backpressure for free (channel blocks when full), work stealing without a scheduler (workers compete on the same channel), and natural streaming (results come out before the work finishes). I've reused this pattern in every pipeline I've built since.

Shut down what isn't fun. This sounds trivial but isn't. I kept the SaaS running for two extra months past the point I'd lost interest because I felt like I should keep going. The project got better the moment I stopped treating it as a business and started treating it as a research tool.

Docker image (was 1.51 GB)
24.6 MB
Avg scrape time (LightPanda)
2.98s
JS site success rate
90%
On 2 vCPU VPS
4 replicas

More projects

Autonomous Security Agent Built on ADK-Go and Gemini

G.O.L.E.M. — an autonomous security agent that finds business-logic vulnerabilities using typed Go tool schemas, dual-browser perception, and model rotation. Built in one day for Google's Gemini challenge.

Read more

Next.js SaaS Starter: Auth, Stripe, and Docs from Day One

Open-source Next.js 14 starter with Supabase auth, Stripe subscriptions (webhook-synced), PostgreSQL RLS, and a Fumadocs documentation site. Built after repeatedly scaffolding the same stack from scratch.

Read more