Scrape Dojo is my new open-source project and the official successor to docudigger. Instead of just scraping Amazon invoices, you can now automate any website with declarative JSON workflows. And yes, AI has massively helped me build it.
Hey! If you've been following me for a while, you might know docudigger, my tool that automatically downloaded Amazon invoices as PDFs. It was cool, but also pretty limited: it could only handle Amazon. At some point, I wanted more. Much more. And that's how Scrape Dojo was born. π₯·
From docudigger to Scrape Dojo: The Evolution π
docudigger was my first serious open-source project. A Node.js tool that used Puppeteer to log into Amazon, navigate through the order history, and download all invoices as PDFs. Super handy for tax season and document management.
But here's the thing: everything was hardcoded for Amazon. Every website change from Amazon broke the tool. And if someone wanted to scrape a different website? Tough luck, rewrite everything from scratch.
The idea behind Scrape Dojo was therefore clear from the start: What if you could describe scraping declaratively? Like Infrastructure-as-Code, but for web scraping. And that's exactly what Scrape Dojo has become.
What Can Scrape Dojo Do? π₯·
Scrape Dojo is a self-hosted web scraping and browser automation platform. The core idea: you don't write Puppeteer code anymore, you describe your scrapes in JSON/JSONC files. The tool handles the rest.
Declarative JSON Workflows
The heart of Scrape Dojo are the workflow files. You define steps and actions, navigate, click, type, extract, download, all in a simple JSON structure:
{
"scrapes": [{
"id": "my-first-scrape",
"metadata": {
"description": "Read a page title",
"triggers": [{ "type": "cron", "value": "0 8 * * *" }]
},
"steps": [{
"name": "Main",
"actions": [
{ "action": "navigate", "params": { "url": "https://example.com" } },
{ "name": "title", "action": "extract", "params": { "selector": "h1" } },
{ "action": "logger", "params": { "message": "Title: {{previousData.title}}" } }
]
}]
}]
}Not a single line of Puppeteer code written, yet a fully functional scraping pipeline with a cron trigger. That's the magic. β¨
25+ Built-in Actions
Scrape Dojo ships with over 25 actions that you can combine in your workflows:
- navigate, click, type, Browser interactions
- extract, extractAll, Pull data from the DOM
- screenshot, download, Save files and screenshots
- loop, condition, Control flow right in the workflow definition
- logger, webhook, Monitoring and notifications
On top of that, you get Handlebars templates and JSONata expressions for dynamic data and powerful transformations.
Real-Time Monitoring
Scrape Dojo comes with a modern Angular UI featuring SSE-based live tracking. You see in real time what your scrape is doing, which step is running, what data is being extracted, whether errors occur. No more guessing from log files.
Security First π
Credentials are stored with AES-256-CBC encryption at rest. Plus, there's optional authentication with JWT, OIDC/SSO, and even MFA/TOTP. For a self-hosted tool that logs into websites, security isn't a nice-to-have, it's a must.
The Tech Stack
Under the hood, Scrape Dojo uses the best tools the JavaScript ecosystem has to offer:
- NestJS 11, Backend framework with modular architecture
- Angular 21, Frontend with modern signal-based state management
- Puppeteer, Headless Chrome for browser automation
- Nx 22, Monorepo management with intelligent caching
- TypeScript, Fully typed from API to UI
AI as a Development Partner π€
Now here's the part many developers don't like to talk about: I actively use AI in the development of Scrape Dojo. And I'm completely transparent about it, the repo even carries an "AI-Aided Development (AIAD)" banner in the README.
This doesn't mean I let the AI generate code and just hit "Merge". It means I use tools like Claude Code as a real development partner:
- Architecture discussions: I discuss design decisions with the AI before writing code
- Code generation: Boilerplate, tests, and repetitive patterns are created with AI assistance
- Code reviews: The AI reviews my code and catches issues I'd miss
- Documentation: The entire docs site at scrape-dojo.com was built with AI support
- Refactoring: Complex restructurings are planned and executed with AI
The result? As a solo developer, I'm significantly more productive. Features that would have taken me days now take hours. And the code quality hasn't dropped, quite the opposite. Thanks to constant AI reviews, the code is more consistent and better tested than docudigger ever was.
Quick Start with Docker π³
The beauty of Scrape Dojo: you can get it running in under a minute. One Docker Compose file and you're good to go:
services:
scrape-dojo:
image: ghcr.io/disane87/scrape-dojo:latest
ports:
- '8080:80'
environment:
- SCRAPE_DOJO_ENCRYPTION_KEY=your_key_here
- DB_TYPE=sqlite
volumes:
- ./data:/home/pptruser/app/data
- ./config:/home/pptruser/app/config
restart: unless-stoppedThen just open http://localhost:8080, UI and API run on the same port. Drop your scrape definitions as JSON files into the config folder and they auto-appear in the UI (hot reload!).
docudigger as a Scrape Dojo Workflow π‘
Best part: everything docudigger could do, you can now replicate as a Scrape Dojo workflow. Log into Amazon, navigate through order history, download invoices, all declarative, all configurable, all extensible. And when Amazon changes their layout again, you just tweak the JSON file instead of debugging code.
Conclusion π―
Scrape Dojo is the logical evolution of docudigger for me. From a single Amazon scraper to a universal scraping platform. From hardcoded Puppeteer scripts to declarative workflows. From solo development to AI-aided development.
The project is open source (MIT license), fully self-hosted, and free for everyone. If you're interested, give it a try, I'd love to hear your feedback, issues, and of course stars on GitHub! β