#+OPTIONS: toc:nil * Automated Browser ** Overview The system has two python scripts; =orchestrator.py= and =scrape.py=. The entry point into the system is =orchestrator=, while =scrape= is contained inside the docker container. The docker image is using =selenium/standalone-chrome:latest= and then installing =undetected-chromedriver= to attempt to bypass the bot detection. ** Usage #+begin_src bash python orchestrator.py [options] #+end_src *** Prerequisites Make sure to have =docker= installed. For python best practices, create a virtual environment and install dependencies: #+begin_src sh python -m venv venv source venv/bin/activate pip install -r requirements.txt #+end_src Also build the container #+begin_src sh docker build -t search-api . #+end_src *** Arguments - =website= (required): The website URL you want to scrape *** Options - =-b, --browser-path=: Path to browser binary (default: =/usr/bin/google-chrome=) - =-a, --browser-args=: Additional arguments to pass to the browser (space-separated) - =-p, --proxy-url=: Proxy URL in format =http://user:pass@host:port= - =-i, --image-name=: Name of the Docker image to use (default: =search-api=) *** Examples #+begin_src bash # Basic usage python orchestrator.py https://example.com # With custom browser path and arguments python orchestrator.py https://example.com -b /usr/bin/chromium -a headless no-sandbox # Using a proxy python orchestrator.py https://example.com -p http://user:pass@proxy.example.com:8080 # Custom Docker image python orchestrator.py https://example.com -i my-scraper-image #+end_src *** Output The script generates: - Performance metrics (CPU, memory, network usage) - Timing information (cold start, response time, total runtime) - Scraped content in =output/output.txt= - HTML report in =output/report.html= (automatically opened in browser if possible) * Design Doc In this document I will answer the questions asked in the email and describe how I would go about scaling this to 10k concurrent users. ** Anti-bot Defenses This would be an ongoing area of research and trial and error. We can implement the core tactics of randomizing interactions delays, simulating mouse movement. We would slowly build features out that would mimic human behavior. Tools like [[https://github.com/daijro/camoufox][Camoufox]] can be incorporated or learned from to provide fingerprinting spoofing. [[https://github.com/mitmproxy/mitmproxy][mitmproxy]] could handle TLS shaping. We're already using =undetected-chromedriver= which apparently handles TLS shaping but we can explore other the solutions. ** Crash Recovery Something like Prometheus + Grafana would allow us to monitor both resource usage and something like Alertmanager could alert stakeholders when something is going wrong. In the case of a crash, we likely need some sort of session state in an in-memory DB like Redis and with a Dead Letter Queue we can keep retrying with exponential backoff. Then with our pooling/session management we can respawn crashed nodes. ** Session Pooling and Management Keeping browser pods up and running and creating custom endpoints to communicate with the containers with something like Nomad with a pool. Then a message broker + Redis can handle queueing. We can incorporate several strategies for when to recycle sessions, either on a timer, when memory goes above a certain amount, or when a certain amount of requests have been processed. We would also likely pool proxies and use network utilization algorithms to determine how to distribute the requests. ** Scaling and Orchestration Model Depending on load patterns, perhaps a cost effective solution here is to use bare-metal servers that can handle a base load of upto 5k users, then the cloud infrastructure would only handle bursts. Nomad can be used to manage and deploy to both scenarios. We can also consider how many regions we would have to cover. ** Unknowns - The biggest unknown for me at the moment is just how far we can push the browser engineering. In particular, between Chromium, WebKit, Gecko, or Servo, which one of these will be the most light on resources, quickest to start up, but also easiest to apply anti-bot techniques for and maintain long-term. - I would need to investigate if applying code patches and compiling any of these browsers is an option to give us the performance gains that would help us scale. - This might be a known already but the sorts of traffic patterns we're going to have would dictate in which regions we would place clusters and also whether the bare-metal idea would work at reducing overall costs. - Another unknown for me is session contamination; how many times can we reuse the same instance to scrape a website before it requires respawning.