README.org and requirements.txt

2025-07-31 21:06:59 +07:00 · 2025-07-31 21:06:59 +07:00 · 3a033c01f2
commit 3a033c01f2
parent 1b00d59b55
2 changed files with 122 additions and 0 deletions
--- a/README.org
+++ b/README.org
@ -0,0 +1,120 @@
+#+OPTIONS: toc:nil
+
+* SearchApi Automated Browser
+
+** Overview
+
+The system has two python scripts; =orchestrator.py= and =scrape.py=. The entry
+point into the system is =orchestrator=, while =scrape= is contained inside the
+docker container. The docker image is using =selenium/standalone-chrome:latest=
+and then installing =undetected-chromedriver= to attempt to bypass the bot
+detection.
+
+** Usage
+
+#+begin_src bash
+python orchestrator.py <website> [options]
+#+end_src
+
+*** Prerequisites
+
+Make sure to have =docker= installed. For python best practices, create a virtual
+environment and install dependencies:
+
+#+begin_src bash
+python -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+#+end_src
+
+*** Arguments
+
+- =website= (required): The website URL you want to scrape
+
+*** Options
+
+- =-b, --browser-path=: Path to browser binary (default: =/usr/bin/google-chrome=)
+- =-a, --browser-args=: Additional arguments to pass to the browser (space-separated)
+- =-p, --proxy-url=: Proxy URL in format =http://user:pass@host:port=
+- =-i, --image-name=: Name of the Docker image to use (default: =search-api=)
+
+*** Examples
+
+#+begin_src bash
+# Basic usage
+python orchestrator.py https://example.com
+
+# With custom browser path and arguments
+python orchestrator.py https://example.com -b /usr/bin/chromium -a headless no-sandbox
+
+# Using a proxy
+python orchestrator.py https://example.com -p http://user:pass@proxy.example.com:8080
+
+# Custom Docker image
+python orchestrator.py https://example.com -i my-scraper-image
+#+end_src
+
+*** Output
+
+The script generates:
+- Performance metrics (CPU, memory, network usage)
+- Timing information (cold start, response time, total runtime)
+- Scraped content in =output/output.txt=
+- HTML report in =output/report.html= (automatically opened in browser if possible)
+
+
+* Design Doc
+
+In this document I will answer the questions asked in the email and describe how
+I would go about scaling this to 10k concurrent users.
+
+** Anti-bot Defenses
+This would be an ongoing area of research and trial and error. We can implement
+the core tactics of randomizing interactions delays, simulating mouse movement.
+We would slowly build features out that would mimic human behavior.
+
+Tools like [[https://github.com/daijro/camoufox][Camoufox]] can be incorporated or learned from to provide
+fingerprinting spoofing. [[https://github.com/mitmproxy/mitmproxy][mitmproxy]] could handle TLS shaping. We're already using
+=undetected-chromedriver= which apparently handles TLS shaping but we can explore
+other the solutions.
+
+** Crash Recovery
+Something like Prometheus + Grafana would allow us to monitor both resource
+usage and something like Alertmanager could alert stakeholders when something is
+going wrong. In the case of a crash, we likely need some sort of session state
+in an in-memory DB like Redis and with a Dead Letter Queue we can keep retrying
+with exponential backoff. Then with our pooling/session management we can
+respawn crashed nodes.
+
+** Session Pooling and Management
+Keeping browser pods up and running and creating custom endpoints to communicate
+with the containers with something like Nomad with a pool. Then a message
+broker + Redis can handle queueing. We can incorporate several strategies for
+when to recycle sessions, either on a timer, when memory goes above a certain
+amount, or when a certain amount of requests have been processed.
+
+We would also likely pool proxies and use network utilization algorithms to
+determine how to distribute the requests.
+
+** Scaling and Orchestration Model
+Depending on load patterns, perhaps a cost effective solution here is to use
+bare-metal servers that can handle a base load of upto 5k users, then the cloud
+infrastructure would only handle bursts. Nomad can be used to manage and deploy
+to both scenarios. We can also consider how many regions we would have to cover.
+
+** Unknowns
+- The biggest unknown for me at the moment is just how far we can push the
+  browser engineering. In particular, between Chromium, WebKit, Gecko, or Servo,
+  which one of these will be the most light on resources, quickest to start up,
+  but also easiest to apply anti-bot techniques for and maintain long-term.
+
+- I would need to investigate if applying code patches and compiling any of
+  these browsers is an option to give us the performance gains that would help
+  us scale.
+
+- This might be a known already but the sorts of traffic patterns we're going to
+  have would dictate in which regions we would place clusters and also whether
+  the bare-metal idea would work at reducing overall costs.
+
+- Another unknown for me is session contamination; how many times can we reuse
+  the same instance to scrape a website before it requires respawning.
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,2 @@
+docker>=6.0.0
+jinja2>=3.0.0