README.org and requirements.txt
This commit is contained in:
parent
1b00d59b55
commit
3a033c01f2
120
README.org
Normal file
120
README.org
Normal file
@ -0,0 +1,120 @@
|
||||
#+OPTIONS: toc:nil
|
||||
|
||||
* SearchApi Automated Browser
|
||||
|
||||
** Overview
|
||||
|
||||
The system has two python scripts; =orchestrator.py= and =scrape.py=. The entry
|
||||
point into the system is =orchestrator=, while =scrape= is contained inside the
|
||||
docker container. The docker image is using =selenium/standalone-chrome:latest=
|
||||
and then installing =undetected-chromedriver= to attempt to bypass the bot
|
||||
detection.
|
||||
|
||||
** Usage
|
||||
|
||||
#+begin_src bash
|
||||
python orchestrator.py <website> [options]
|
||||
#+end_src
|
||||
|
||||
*** Prerequisites
|
||||
|
||||
Make sure to have =docker= installed. For python best practices, create a virtual
|
||||
environment and install dependencies:
|
||||
|
||||
#+begin_src bash
|
||||
python -m venv venv
|
||||
source venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
#+end_src
|
||||
|
||||
*** Arguments
|
||||
|
||||
- =website= (required): The website URL you want to scrape
|
||||
|
||||
*** Options
|
||||
|
||||
- =-b, --browser-path=: Path to browser binary (default: =/usr/bin/google-chrome=)
|
||||
- =-a, --browser-args=: Additional arguments to pass to the browser (space-separated)
|
||||
- =-p, --proxy-url=: Proxy URL in format =http://user:pass@host:port=
|
||||
- =-i, --image-name=: Name of the Docker image to use (default: =search-api=)
|
||||
|
||||
*** Examples
|
||||
|
||||
#+begin_src bash
|
||||
# Basic usage
|
||||
python orchestrator.py https://example.com
|
||||
|
||||
# With custom browser path and arguments
|
||||
python orchestrator.py https://example.com -b /usr/bin/chromium -a headless no-sandbox
|
||||
|
||||
# Using a proxy
|
||||
python orchestrator.py https://example.com -p http://user:pass@proxy.example.com:8080
|
||||
|
||||
# Custom Docker image
|
||||
python orchestrator.py https://example.com -i my-scraper-image
|
||||
#+end_src
|
||||
|
||||
*** Output
|
||||
|
||||
The script generates:
|
||||
- Performance metrics (CPU, memory, network usage)
|
||||
- Timing information (cold start, response time, total runtime)
|
||||
- Scraped content in =output/output.txt=
|
||||
- HTML report in =output/report.html= (automatically opened in browser if possible)
|
||||
|
||||
|
||||
* Design Doc
|
||||
|
||||
In this document I will answer the questions asked in the email and describe how
|
||||
I would go about scaling this to 10k concurrent users.
|
||||
|
||||
** Anti-bot Defenses
|
||||
This would be an ongoing area of research and trial and error. We can implement
|
||||
the core tactics of randomizing interactions delays, simulating mouse movement.
|
||||
We would slowly build features out that would mimic human behavior.
|
||||
|
||||
Tools like [[https://github.com/daijro/camoufox][Camoufox]] can be incorporated or learned from to provide
|
||||
fingerprinting spoofing. [[https://github.com/mitmproxy/mitmproxy][mitmproxy]] could handle TLS shaping. We're already using
|
||||
=undetected-chromedriver= which apparently handles TLS shaping but we can explore
|
||||
other the solutions.
|
||||
|
||||
** Crash Recovery
|
||||
Something like Prometheus + Grafana would allow us to monitor both resource
|
||||
usage and something like Alertmanager could alert stakeholders when something is
|
||||
going wrong. In the case of a crash, we likely need some sort of session state
|
||||
in an in-memory DB like Redis and with a Dead Letter Queue we can keep retrying
|
||||
with exponential backoff. Then with our pooling/session management we can
|
||||
respawn crashed nodes.
|
||||
|
||||
** Session Pooling and Management
|
||||
Keeping browser pods up and running and creating custom endpoints to communicate
|
||||
with the containers with something like Nomad with a pool. Then a message
|
||||
broker + Redis can handle queueing. We can incorporate several strategies for
|
||||
when to recycle sessions, either on a timer, when memory goes above a certain
|
||||
amount, or when a certain amount of requests have been processed.
|
||||
|
||||
We would also likely pool proxies and use network utilization algorithms to
|
||||
determine how to distribute the requests.
|
||||
|
||||
** Scaling and Orchestration Model
|
||||
Depending on load patterns, perhaps a cost effective solution here is to use
|
||||
bare-metal servers that can handle a base load of upto 5k users, then the cloud
|
||||
infrastructure would only handle bursts. Nomad can be used to manage and deploy
|
||||
to both scenarios. We can also consider how many regions we would have to cover.
|
||||
|
||||
** Unknowns
|
||||
- The biggest unknown for me at the moment is just how far we can push the
|
||||
browser engineering. In particular, between Chromium, WebKit, Gecko, or Servo,
|
||||
which one of these will be the most light on resources, quickest to start up,
|
||||
but also easiest to apply anti-bot techniques for and maintain long-term.
|
||||
|
||||
- I would need to investigate if applying code patches and compiling any of
|
||||
these browsers is an option to give us the performance gains that would help
|
||||
us scale.
|
||||
|
||||
- This might be a known already but the sorts of traffic patterns we're going to
|
||||
have would dictate in which regions we would place clusters and also whether
|
||||
the bare-metal idea would work at reducing overall costs.
|
||||
|
||||
- Another unknown for me is session contamination; how many times can we reuse
|
||||
the same instance to scrape a website before it requires respawning.
|
2
requirements.txt
Normal file
2
requirements.txt
Normal file
@ -0,0 +1,2 @@
|
||||
docker>=6.0.0
|
||||
jinja2>=3.0.0
|
Loading…
x
Reference in New Issue
Block a user