auto-scraper/Task.org

44 lines
1.4 KiB
Org Mode
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

* Task
Build a Docker image that boots a minimal browser (Chromium, Firefox, Safari, or Edge all work). Then write a small script that uses the image to scrape the following URL:
https://www.google.com/search?q=MINISFORUM+MS-A2
Requirements:
- Accept optional proxy URL and optional browser launch flags
* Estimate and report:
- Cold start time
- Total transfer size (bandwidth over the wire)
- Time to response
- CPU and memory usage
- Save final HTML output to a file
- Use any language you're comfortable with
- We can provide a proxy URL, or you can use your own
* Goal:
Optimize for:
- Low latency
- Minimal bandwidth
- High success rate (avoid bans, captchas, etc.)
Then:
Write a short design doc (max 4 pages) outlining how you'd scale this to 10k concurrent requests. No need to detail measurement tooling just focus on next steps to evolve this into a full browser farm. Include:
- Fingerprinting and TLS shaping
- Crash recovery
- Session pooling and management
- Scaling and orchestration model
- Anti-bot defenses
- Unknowns and how you'd tackle them
We want to see how you'd approach this independently and steer the project forward. You dont need to know everything, but the plan should be grounded and reasonable.
Time cap: 12 days max. Let us know if that sounds fair or if you'd prefer to tweak anything. Were flexible, just aiming for something valuable and time-bounded.