127 lines
4.7 KiB
Org Mode
127 lines
4.7 KiB
Org Mode
#+OPTIONS: toc:nil
|
|
|
|
* Automated Browser
|
|
|
|
** Overview
|
|
|
|
The system has two python scripts; =orchestrator.py= and =scrape.py=. The entry
|
|
point into the system is =orchestrator=, while =scrape= is contained inside the
|
|
docker container. The docker image is using =selenium/standalone-chrome:latest=
|
|
and then installing =undetected-chromedriver= to attempt to bypass the bot
|
|
detection.
|
|
|
|
** Usage
|
|
|
|
#+begin_src bash
|
|
python orchestrator.py <website> [options]
|
|
#+end_src
|
|
|
|
*** Prerequisites
|
|
|
|
Make sure to have =docker= installed. For python best practices, create a virtual
|
|
environment and install dependencies:
|
|
|
|
#+begin_src sh
|
|
python -m venv venv
|
|
source venv/bin/activate
|
|
pip install -r requirements.txt
|
|
#+end_src
|
|
|
|
Also build the container
|
|
|
|
#+begin_src sh
|
|
docker build -t search-api .
|
|
#+end_src
|
|
|
|
*** Arguments
|
|
|
|
- =website= (required): The website URL you want to scrape
|
|
|
|
*** Options
|
|
|
|
- =-b, --browser-path=: Path to browser binary (default: =/usr/bin/google-chrome=)
|
|
- =-a, --browser-args=: Additional arguments to pass to the browser (space-separated)
|
|
- =-p, --proxy-url=: Proxy URL in format =http://user:pass@host:port=
|
|
- =-i, --image-name=: Name of the Docker image to use (default: =search-api=)
|
|
|
|
*** Examples
|
|
|
|
#+begin_src bash
|
|
# Basic usage
|
|
python orchestrator.py https://example.com
|
|
|
|
# With custom browser path and arguments
|
|
python orchestrator.py https://example.com -b /usr/bin/chromium -a headless no-sandbox
|
|
|
|
# Using a proxy
|
|
python orchestrator.py https://example.com -p http://user:pass@proxy.example.com:8080
|
|
|
|
# Custom Docker image
|
|
python orchestrator.py https://example.com -i my-scraper-image
|
|
#+end_src
|
|
|
|
*** Output
|
|
|
|
The script generates:
|
|
- Performance metrics (CPU, memory, network usage)
|
|
- Timing information (cold start, response time, total runtime)
|
|
- Scraped content in =output/output.txt=
|
|
- HTML report in =output/report.html= (automatically opened in browser if possible)
|
|
|
|
|
|
* Design Doc
|
|
|
|
In this document I will answer the questions asked in the email and describe how
|
|
I would go about scaling this to 10k concurrent users.
|
|
|
|
** Anti-bot Defenses
|
|
This would be an ongoing area of research and trial and error. We can implement
|
|
the core tactics of randomizing interactions delays, simulating mouse movement.
|
|
We would slowly build features out that would mimic human behavior.
|
|
|
|
Tools like [[https://github.com/daijro/camoufox][Camoufox]] can be incorporated or learned from to provide
|
|
fingerprinting spoofing. [[https://github.com/mitmproxy/mitmproxy][mitmproxy]] could handle TLS shaping. We're already using
|
|
=undetected-chromedriver= which apparently handles TLS shaping but we can explore
|
|
other the solutions.
|
|
|
|
** Crash Recovery
|
|
Something like Prometheus + Grafana would allow us to monitor both resource
|
|
usage and something like Alertmanager could alert stakeholders when something is
|
|
going wrong. In the case of a crash, we likely need some sort of session state
|
|
in an in-memory DB like Redis and with a Dead Letter Queue we can keep retrying
|
|
with exponential backoff. Then with our pooling/session management we can
|
|
respawn crashed nodes.
|
|
|
|
** Session Pooling and Management
|
|
Keeping browser pods up and running and creating custom endpoints to communicate
|
|
with the containers with something like Nomad with a pool. Then a message
|
|
broker + Redis can handle queueing. We can incorporate several strategies for
|
|
when to recycle sessions, either on a timer, when memory goes above a certain
|
|
amount, or when a certain amount of requests have been processed.
|
|
|
|
We would also likely pool proxies and use network utilization algorithms to
|
|
determine how to distribute the requests.
|
|
|
|
** Scaling and Orchestration Model
|
|
Depending on load patterns, perhaps a cost effective solution here is to use
|
|
bare-metal servers that can handle a base load of upto 5k users, then the cloud
|
|
infrastructure would only handle bursts. Nomad can be used to manage and deploy
|
|
to both scenarios. We can also consider how many regions we would have to cover.
|
|
|
|
** Unknowns
|
|
- The biggest unknown for me at the moment is just how far we can push the
|
|
browser engineering. In particular, between Chromium, WebKit, Gecko, or Servo,
|
|
which one of these will be the most light on resources, quickest to start up,
|
|
but also easiest to apply anti-bot techniques for and maintain long-term.
|
|
|
|
- I would need to investigate if applying code patches and compiling any of
|
|
these browsers is an option to give us the performance gains that would help
|
|
us scale.
|
|
|
|
- This might be a known already but the sorts of traffic patterns we're going to
|
|
have would dictate in which regions we would place clusters and also whether
|
|
the bare-metal idea would work at reducing overall costs.
|
|
|
|
- Another unknown for me is session contamination; how many times can we reuse
|
|
the same instance to scrape a website before it requires respawning.
|