auto-scraper/README.org

#+OPTIONS: toc:nil

* Automated Browser

** Overview

The system has two python scripts; =orchestrator.py= and =scrape.py=. The entry
point into the system is =orchestrator=, while =scrape= is contained inside the
docker container. The docker image is using =selenium/standalone-chrome:latest=
and then installing =undetected-chromedriver= to attempt to bypass the bot
detection.

** Usage

#+begin_src bash
python orchestrator.py <website> [options]
#+end_src

*** Prerequisites

Make sure to have =docker= installed. For python best practices, create a virtual
environment and install dependencies:

#+begin_src sh
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
#+end_src

Also build the container

#+begin_src sh
docker build -t search-api .
#+end_src

*** Arguments

- =website= (required): The website URL you want to scrape

*** Options

- =-b, --browser-path=: Path to browser binary (default: =/usr/bin/google-chrome=)
- =-a, --browser-args=: Additional arguments to pass to the browser (space-separated)
- =-p, --proxy-url=: Proxy URL in format =http://user:pass@host:port=
- =-i, --image-name=: Name of the Docker image to use (default: =search-api=)

*** Examples

#+begin_src bash
# Basic usage
python orchestrator.py https://example.com

# With custom browser path and arguments
python orchestrator.py https://example.com -b /usr/bin/chromium -a headless no-sandbox

# Using a proxy
python orchestrator.py https://example.com -p http://user:pass@proxy.example.com:8080

# Custom Docker image
python orchestrator.py https://example.com -i my-scraper-image
#+end_src

*** Output

The script generates:
- Performance metrics (CPU, memory, network usage)
- Timing information (cold start, response time, total runtime)
- Scraped content in =output/output.txt=
- HTML report in =output/report.html= (automatically opened in browser if possible)


* Design Doc

In this document I will answer the questions asked in the email and describe how
I would go about scaling this to 10k concurrent users.

** Anti-bot Defenses
This would be an ongoing area of research and trial and error. We can implement
the core tactics of randomizing interactions delays, simulating mouse movement.
We would slowly build features out that would mimic human behavior.

Tools like [[https://github.com/daijro/camoufox][Camoufox]] can be incorporated or learned from to provide
fingerprinting spoofing. [[https://github.com/mitmproxy/mitmproxy][mitmproxy]] could handle TLS shaping. We're already using
=undetected-chromedriver= which apparently handles TLS shaping but we can explore
other the solutions.

** Crash Recovery
Something like Prometheus + Grafana would allow us to monitor both resource
usage and something like Alertmanager could alert stakeholders when something is
going wrong. In the case of a crash, we likely need some sort of session state
in an in-memory DB like Redis and with a Dead Letter Queue we can keep retrying
with exponential backoff. Then with our pooling/session management we can
respawn crashed nodes.

** Session Pooling and Management
Keeping browser pods up and running and creating custom endpoints to communicate
with the containers with something like Nomad with a pool. Then a message
broker + Redis can handle queueing. We can incorporate several strategies for
when to recycle sessions, either on a timer, when memory goes above a certain
amount, or when a certain amount of requests have been processed.

We would also likely pool proxies and use network utilization algorithms to
determine how to distribute the requests.

** Scaling and Orchestration Model
Depending on load patterns, perhaps a cost effective solution here is to use
bare-metal servers that can handle a base load of upto 5k users, then the cloud
infrastructure would only handle bursts. Nomad can be used to manage and deploy
to both scenarios. We can also consider how many regions we would have to cover.

** Unknowns
- The biggest unknown for me at the moment is just how far we can push the
  browser engineering. In particular, between Chromium, WebKit, Gecko, or Servo,
  which one of these will be the most light on resources, quickest to start up,
  but also easiest to apply anti-bot techniques for and maintain long-term.

- I would need to investigate if applying code patches and compiling any of
  these browsers is an option to give us the performance gains that would help
  us scale.

- This might be a known already but the sorts of traffic patterns we're going to
  have would dictate in which regions we would place clusters and also whether
  the bare-metal idea would work at reducing overall costs.

- Another unknown for me is session contamination; how many times can we reuse
  the same instance to scrape a website before it requires respawning.