diff --git a/README.org b/README.org new file mode 100644 index 0000000..667b05c --- /dev/null +++ b/README.org @@ -0,0 +1,120 @@ +#+OPTIONS: toc:nil + +* SearchApi Automated Browser + +** Overview + +The system has two python scripts; =orchestrator.py= and =scrape.py=. The entry +point into the system is =orchestrator=, while =scrape= is contained inside the +docker container. The docker image is using =selenium/standalone-chrome:latest= +and then installing =undetected-chromedriver= to attempt to bypass the bot +detection. + +** Usage + +#+begin_src bash +python orchestrator.py [options] +#+end_src + +*** Prerequisites + +Make sure to have =docker= installed. For python best practices, create a virtual +environment and install dependencies: + +#+begin_src bash +python -m venv venv +source venv/bin/activate +pip install -r requirements.txt +#+end_src + +*** Arguments + +- =website= (required): The website URL you want to scrape + +*** Options + +- =-b, --browser-path=: Path to browser binary (default: =/usr/bin/google-chrome=) +- =-a, --browser-args=: Additional arguments to pass to the browser (space-separated) +- =-p, --proxy-url=: Proxy URL in format =http://user:pass@host:port= +- =-i, --image-name=: Name of the Docker image to use (default: =search-api=) + +*** Examples + +#+begin_src bash +# Basic usage +python orchestrator.py https://example.com + +# With custom browser path and arguments +python orchestrator.py https://example.com -b /usr/bin/chromium -a headless no-sandbox + +# Using a proxy +python orchestrator.py https://example.com -p http://user:pass@proxy.example.com:8080 + +# Custom Docker image +python orchestrator.py https://example.com -i my-scraper-image +#+end_src + +*** Output + +The script generates: +- Performance metrics (CPU, memory, network usage) +- Timing information (cold start, response time, total runtime) +- Scraped content in =output/output.txt= +- HTML report in =output/report.html= (automatically opened in browser if possible) + + +* Design Doc + +In this document I will answer the questions asked in the email and describe how +I would go about scaling this to 10k concurrent users. + +** Anti-bot Defenses +This would be an ongoing area of research and trial and error. We can implement +the core tactics of randomizing interactions delays, simulating mouse movement. +We would slowly build features out that would mimic human behavior. + +Tools like [[https://github.com/daijro/camoufox][Camoufox]] can be incorporated or learned from to provide +fingerprinting spoofing. [[https://github.com/mitmproxy/mitmproxy][mitmproxy]] could handle TLS shaping. We're already using +=undetected-chromedriver= which apparently handles TLS shaping but we can explore +other the solutions. + +** Crash Recovery +Something like Prometheus + Grafana would allow us to monitor both resource +usage and something like Alertmanager could alert stakeholders when something is +going wrong. In the case of a crash, we likely need some sort of session state +in an in-memory DB like Redis and with a Dead Letter Queue we can keep retrying +with exponential backoff. Then with our pooling/session management we can +respawn crashed nodes. + +** Session Pooling and Management +Keeping browser pods up and running and creating custom endpoints to communicate +with the containers with something like Nomad with a pool. Then a message +broker + Redis can handle queueing. We can incorporate several strategies for +when to recycle sessions, either on a timer, when memory goes above a certain +amount, or when a certain amount of requests have been processed. + +We would also likely pool proxies and use network utilization algorithms to +determine how to distribute the requests. + +** Scaling and Orchestration Model +Depending on load patterns, perhaps a cost effective solution here is to use +bare-metal servers that can handle a base load of upto 5k users, then the cloud +infrastructure would only handle bursts. Nomad can be used to manage and deploy +to both scenarios. We can also consider how many regions we would have to cover. + +** Unknowns +- The biggest unknown for me at the moment is just how far we can push the + browser engineering. In particular, between Chromium, WebKit, Gecko, or Servo, + which one of these will be the most light on resources, quickest to start up, + but also easiest to apply anti-bot techniques for and maintain long-term. + +- I would need to investigate if applying code patches and compiling any of + these browsers is an option to give us the performance gains that would help + us scale. + +- This might be a known already but the sorts of traffic patterns we're going to + have would dictate in which regions we would place clusters and also whether + the bare-metal idea would work at reducing overall costs. + +- Another unknown for me is session contamination; how many times can we reuse + the same instance to scrape a website before it requires respawning. diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..fcdf0ec --- /dev/null +++ b/requirements.txt @@ -0,0 +1,2 @@ +docker>=6.0.0 +jinja2>=3.0.0 \ No newline at end of file