Scrape Dojo: From Amazon Scraper to Universal Platform 🥷

Scrape Dojo is my new open-source project and the official successor to docudigger. Instead of just scraping Amazon invoices, you can now automate any website with declarative JSON workflows. And yes, AI has massively helped me build it.

Hey! If you've been following me for a while, you might know docudigger, my tool that automatically downloaded Amazon invoices as PDFs. It was cool, but also pretty limited: it could only handle Amazon. At some point, I wanted more. Much more. And that's how Scrape Dojo was born. 🥷

From docudigger to Scrape Dojo: The Evolution 🔄

docudigger was my first serious open-source project. A Node.js tool that used Puppeteer to log into Amazon, navigate through the order history, and download all invoices as PDFs. Super handy for tax season and document management.

But here's the thing: everything was hardcoded for Amazon. Every website change from Amazon broke the tool. And if someone wanted to scrape a different website? Tough luck, rewrite everything from scratch.

The idea behind Scrape Dojo was therefore clear from the start: What if you could describe scraping declaratively? Like Infrastructure-as-Code, but for web scraping. And that's exactly what Scrape Dojo has become.

What Can Scrape Dojo Do? 🥷

Scrape Dojo is a self-hosted web scraping and browser automation platform. The core idea: you don't write Puppeteer code anymore, you describe your scrapes in JSON/JSONC files. The tool handles the rest.

Declarative JSON Workflows

The heart of Scrape Dojo are the workflow files. You define steps and actions, navigate, click, type, extract, download, all in a simple JSON structure:

{
  "scrapes": [{
    "id": "my-first-scrape",
    "metadata": {
      "description": "Read a page title",
      "triggers": [{ "type": "cron", "value": "0 8 * * *" }]
    },
    "steps": [{
      "name": "Main",
      "actions": [
        { "action": "navigate", "params": { "url": "https://example.com" } },
        { "name": "title", "action": "extract", "params": { "selector": "h1" } },
        { "action": "logger", "params": { "message": "Title: {{previousData.title}}" } }
      ]
    }]
  }]
}

Not a single line of Puppeteer code written, yet a fully functional scraping pipeline with a cron trigger. That's the magic. ✨

25+ Built-in Actions

Scrape Dojo ships with over 25 actions that you can combine in your workflows:

navigate, click, type, Browser interactions
extract, extractAll, Pull data from the DOM
screenshot, download, Save files and screenshots
loop, condition, Control flow right in the workflow definition
logger, webhook, Monitoring and notifications

On top of that, you get Handlebars templates and JSONata expressions for dynamic data and powerful transformations.

Real-Time Monitoring

Scrape Dojo comes with a modern Angular UI featuring SSE-based live tracking. You see in real time what your scrape is doing, which step is running, what data is being extracted, whether errors occur. No more guessing from log files.

Security First 🔐

Credentials are stored with AES-256-CBC encryption at rest. Plus, there's optional authentication with JWT, OIDC/SSO, and even MFA/TOTP. For a self-hosted tool that logs into websites, security isn't a nice-to-have, it's a must.

The Tech Stack

Under the hood, Scrape Dojo uses the best tools the JavaScript ecosystem has to offer:

NestJS 11, Backend framework with modular architecture
Angular 21, Frontend with modern signal-based state management
Puppeteer, Headless Chrome for browser automation
Nx 22, Monorepo management with intelligent caching
TypeScript, Fully typed from API to UI

Nx Monorepos: Multiple Projects, One Repository 📦

Learn more about Nx and why it's perfect for projects like Scrape Dojo.

DisaneDev Blog

AI as a Development Partner 🤖

Now here's the part many developers don't like to talk about: I actively use AI in the development of Scrape Dojo. And I'm completely transparent about it, the repo even carries an "AI-Aided Development (AIAD)" banner in the README.

This doesn't mean I let the AI generate code and just hit "Merge". It means I use tools like Claude Code as a real development partner:

Architecture discussions: I discuss design decisions with the AI before writing code
Code generation: Boilerplate, tests, and repetitive patterns are created with AI assistance
Code reviews: The AI reviews my code and catches issues I'd miss
Documentation: The entire docs site at scrape-dojo.com was built with AI support
Refactoring: Complex restructurings are planned and executed with AI

The result? As a solo developer, I'm significantly more productive. Features that would have taken me days now take hours. And the code quality hasn't dropped, quite the opposite. Thanks to constant AI reviews, the code is more consistent and better tested than docudigger ever was.

AI Coding Tools: GitHub Copilot, Cursor & Claude Code Compared 🤖

My comparison of the three major AI coding tools — including Claude Code, which I use daily for Scrape Dojo.

DisaneDev Blog

Quick Start with Docker 🐳

The beauty of Scrape Dojo: you can get it running in under a minute. One Docker Compose file and you're good to go:

services:
  scrape-dojo:
    image: ghcr.io/disane87/scrape-dojo:latest
    ports:
      - '8080:80'
    environment:
      - SCRAPE_DOJO_ENCRYPTION_KEY=your_key_here
      - DB_TYPE=sqlite
    volumes:
      - ./data:/home/pptruser/app/data
      - ./config:/home/pptruser/app/config
    restart: unless-stopped

Then just open http://localhost:8080, UI and API run on the same port. Drop your scrape definitions as JSON files into the config folder and they auto-appear in the UI (hot reload!).

docudigger as a Scrape Dojo Workflow 💡

Best part: everything docudigger could do, you can now replicate as a Scrape Dojo workflow. Log into Amazon, navigate through order history, download invoices, all declarative, all configurable, all extensible. And when Amazon changes their layout again, you just tweak the JSON file instead of debugging code.

Conclusion 🎯

Scrape Dojo is the logical evolution of docudigger for me. From a single Amazon scraper to a universal scraping platform. From hardcoded Puppeteer scripts to declarative workflows. From solo development to AI-aided development.

The project is open source (MIT license), fully self-hosted, and free for everyone. If you're interested, give it a try, I'd love to hear your feedback, issues, and of course stars on GitHub! ⭐

Scrape Dojo — Documentation

The official Scrape Dojo documentation with quickstart guide, action reference, examples, and more.

scrape-dojo.com