auto-scraper

The system has two python scripts; orchestrator.py and scrape.py. The entry point into the system is orchestrator, while scrape is contained inside the docker container. The docker image is using selenium/standalone-chrome:latest and then installing undetected-chromedriver to attempt to bypass the bot detection.

Usage

python orchestrator.py <website> [options]

Prerequisites

Make sure to have docker installed. For python best practices, create a virtual environment and install dependencies:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Also build the container

docker build -t search-api .

Arguments

website (required): The website URL you want to scrape

Options

-b, --browser-path: Path to browser binary (default: /usr/bin/google-chrome)
-a, --browser-args: Additional arguments to pass to the browser (space-separated)
-p, --proxy-url: Proxy URL in format http://user:pass@host:port
-i, --image-name: Name of the Docker image to use (default: search-api)

Examples

# Basic usage
python orchestrator.py https://example.com

# With custom browser path and arguments
python orchestrator.py https://example.com -b /usr/bin/chromium -a headless no-sandbox

# Using a proxy
python orchestrator.py https://example.com -p http://user:pass@proxy.example.com:8080

# Custom Docker image
python orchestrator.py https://example.com -i my-scraper-image

Output

The script generates:

Performance metrics (CPU, memory, network usage)
Timing information (cold start, response time, total runtime)
Scraped content in output/output.txt
HTML report in output/report.html (automatically opened in browser if possible)

Design Doc

In this document I will answer the questions asked in the email and describe how I would go about scaling this to 10k concurrent users.

Anti-bot Defenses

This would be an ongoing area of research and trial and error. We can implement the core tactics of randomizing interactions delays, simulating mouse movement. We would slowly build features out that would mimic human behavior.

Tools like Camoufox can be incorporated or learned from to provide fingerprinting spoofing. mitmproxy could handle TLS shaping. We're already using undetected-chromedriver which apparently handles TLS shaping but we can explore other the solutions.

Crash Recovery

Something like Prometheus + Grafana would allow us to monitor both resource usage and something like Alertmanager could alert stakeholders when something is going wrong. In the case of a crash, we likely need some sort of session state in an in-memory DB like Redis and with a Dead Letter Queue we can keep retrying with exponential backoff. Then with our pooling/session management we can respawn crashed nodes.

Session Pooling and Management

Keeping browser pods up and running and creating custom endpoints to communicate with the containers with something like Nomad with a pool. Then a message broker + Redis can handle queueing. We can incorporate several strategies for when to recycle sessions, either on a timer, when memory goes above a certain amount, or when a certain amount of requests have been processed.

We would also likely pool proxies and use network utilization algorithms to determine how to distribute the requests.

Scaling and Orchestration Model

Depending on load patterns, perhaps a cost effective solution here is to use bare-metal servers that can handle a base load of upto 5k users, then the cloud infrastructure would only handle bursts. Nomad can be used to manage and deploy to both scenarios. We can also consider how many regions we would have to cover.

Unknowns

The biggest unknown for me at the moment is just how far we can push the browser engineering. In particular, between Chromium, WebKit, Gecko, or Servo, which one of these will be the most light on resources, quickest to start up, but also easiest to apply anti-bot techniques for and maintain long-term.
I would need to investigate if applying code patches and compiling any of these browsers is an option to give us the performance gains that would help us scale.
This might be a known already but the sorts of traffic patterns we're going to have would dictate in which regions we would place clusters and also whether the bare-metal idea would work at reducing overall costs.
Another unknown for me is session contamination; how many times can we reuse the same instance to scrape a website before it requires respawning.