Scrape my website with zenika/alpine-chrome

This commit is contained in:
Joseph Ferano 2025-07-30 11:44:43 +07:00
commit 875b8c35ac
4 changed files with 169 additions and 0 deletions

7
Dockerfile Normal file
View File

@ -0,0 +1,7 @@
FROM zenika/alpine-chrome:latest
# Expose port 3000 for remote debugging
EXPOSE 3000
# Override the default command to use port 3000
CMD ["chromium-browser", "--headless", "--no-sandbox", "--disable-gpu", "--remote-debugging-port=3000", "--remote-debugging-address=0.0.0.0"]

25
Log.org Normal file
View File

@ -0,0 +1,25 @@
* Setting up
After reading the following on https://hub.docker.com/r/browserless/chrome
#+begin_quote
Getting Chrome running well in docker is also a challenge as there's quiet a few packages you need in order to get Chrome running. Once that's done then there's still missing fonts, getting libraries to work with it, and having limitations on service reliability.
#+end_quote
Made me think twice about setting it up myself, so just grabbed this for now.
- I realized soon eough that ws://localhost:3000 is browserless' own API, so I went
and tried to figure out how to go about getting the websocket for the chrome
devtools, turns out I need to launch an instance first.
Browserless has an API but I went through the documentation and quickly felt
like it probably defeats the purpose of the exercise to use them, so I instead
used this;
https://hub.docker.com/r/zenika/alpine-chrome
Perhaps the exercise is looking for me to actually build an image from scratch,
but let's make progress on all other other tasks before tackling that.

43
Task.org Normal file
View File

@ -0,0 +1,43 @@
* Task
Build a Docker image that boots a minimal browser (Chromium, Firefox, Safari, or Edge all work). Then write a small script that uses the image to scrape the following URL:
https://www.google.com/search?q=MINISFORUM+MS-A2
Requirements:
- Accept optional proxy URL and optional browser launch flags
* Estimate and report:
- Cold start time
- Total transfer size (bandwidth over the wire)
- Time to response
- CPU and memory usage
- Save final HTML output to a file
- Use any language you're comfortable with
- We can provide a proxy URL, or you can use your own
* Goal:
Optimize for:
- Low latency
- Minimal bandwidth
- High success rate (avoid bans, captchas, etc.)
Then:
Write a short design doc (max 4 pages) outlining how you'd scale this to 10k concurrent requests. No need to detail measurement tooling just focus on next steps to evolve this into a full browser farm. Include:
- Fingerprinting and TLS shaping
- Crash recovery
- Session pooling and management
- Scaling and orchestration model
- Anti-bot defenses
- Unknowns and how you'd tackle them
We want to see how you'd approach this independently and steer the project forward. You dont need to know everything, but the plan should be grounded and reasonable.
Time cap: 12 days max. Let us know if that sounds fair or if you'd prefer to tweak anything. Were flexible, just aiming for something valuable and time-bounded.

94
scrape.py Normal file
View File

@ -0,0 +1,94 @@
import requests
import websockets
import json
import asyncio
from pprint import pprint
async def scrape():
id_count = [0]
def get_id():
id_count[0] += 1
return id_count[0]
response = requests.get("http://localhost:3000/json")
targets = response.json()
if not targets:
print("No active sessions found")
return
websocket_url = targets[0]['webSocketDebuggerUrl']
print(f"Connecting to: {websocket_url}")
async with websockets.connect(websocket_url) as ws:
for elem in ["DOM", "Page"]:
print("Enabling", elem)
await ws.send(json.dumps({
"id": get_id(),
"method": f"{elem}.enable"
}))
# await asyncio.sleep(1)
response = await ws.recv()
print(f"{elem} enabled:", json.loads(response))
print("Staring up")
await ws.send(json.dumps({
"id": get_id(),
"method": "Page.navigate",
# "params": {"url": "https://www.google.com/search?q=MINISFORUM+MS-A2"}
"params": {"url": "https://ferano.io"}
}))
print("Send navigate request")
while True:
response = await ws.recv()
data = json.loads(response)
if data.get("method") == "Page.loadEventFired":
break
print("Got loadEventFired event")
print("Get Document...")
await ws.send(json.dumps({
"id": get_id(),
"method": "DOM.getDocument"
}))
print("Woot")
document_id = id_count[0] # Store the ID we just used
while True:
response = await ws.recv()
data = json.loads(response)
# Check if this is the response to our DOM.getDocument request
if data.get("id") == document_id:
root_node_id = data['result']['root']['nodeId']
await ws.send(json.dumps({
"id": get_id(),
"method": "DOM.getOuterHTML",
"params": {"nodeId": root_node_id}
}))
html_id = id_count[0]
while True:
response = await ws.recv()
data = json.loads(response)
if data.get("id") == html_id and "result" in data:
html_content = data['result']['outerHTML']
print(html_content)
break
else:
print("Received event:", data)
print("Something happened")
break
# response = await ws.recv()
# root_data = json.loads(response)
# root_node_id = root_data["result"]["root"]["nodeId"]
# print(root_data)
asyncio.run(scrape())