Scrape my website with zenika/alpine-chrome

2025-07-30 11:44:43 +07:00 · 2025-07-30 11:44:43 +07:00 · 875b8c35ac
commit 875b8c35ac
4 changed files with 169 additions and 0 deletions
--- a/7
+++ b/7
@ -0,0 +1,7 @@
 FROM zenika/alpine-chrome:latest
 # Expose port 3000 for remote debugging
 EXPOSE 3000
 # Override the default command to use port 3000
 CMD ["chromium-browser", "--headless", "--no-sandbox", "--disable-gpu", "--remote-debugging-port=3000", "--remote-debugging-address=0.0.0.0"]
--- a/Log.org
+++ b/Log.org
@ -0,0 +1,25 @@
 * Setting up 
 After reading the following on https://hub.docker.com/r/browserless/chrome 
 #+begin_quote
 Getting Chrome running well in docker is also a challenge as there's quiet a few packages you need in order to get Chrome running. Once that's done then there's still missing fonts, getting libraries to work with it, and having limitations on service reliability.
 #+end_quote
 Made me think twice about setting it up myself, so just grabbed this for now. 
 - I realized soon eough that ws://localhost:3000 is browserless' own API, so I went
  and tried to figure out how to go about getting the websocket for the chrome
  devtools, turns out I need to launch an instance first.
 Browserless has an API but I went through the documentation and quickly felt
 like it probably defeats the purpose of the exercise to use them, so I instead
 used this;
 https://hub.docker.com/r/zenika/alpine-chrome
 Perhaps the exercise is looking for me to actually build an image from scratch,
 but let's make progress on all other other tasks before tackling that.
--- a/Task.org
+++ b/Task.org
@ -0,0 +1,43 @@
 * Task
 Build a Docker image that boots a minimal browser (Chromium, Firefox, Safari, or Edge all work). Then write a small script that uses the image to scrape the following URL:
 https://www.google.com/search?q=MINISFORUM+MS-A2
 Requirements:
 - Accept optional proxy URL and optional browser launch flags
 * Estimate and report:
 - Cold start time
 - Total transfer size (bandwidth over the wire)
 - Time to response
 - CPU and memory usage
 - Save final HTML output to a file
 - Use any language you're comfortable with
 - We can provide a proxy URL, or you can use your own
 * Goal:
 Optimize for:
 - Low latency
 - Minimal bandwidth
 - High success rate (avoid bans, captchas, etc.)
 Then:
 Write a short design doc (max 4 pages) outlining how you'd scale this to 10k concurrent requests. No need to detail measurement tooling just focus on next steps to evolve this into a full browser farm. Include:
 - Fingerprinting and TLS shaping
 - Crash recovery
 - Session pooling and management
 - Scaling and orchestration model
 - Anti-bot defenses
 - Unknowns and how you'd tackle them
 We want to see how you'd approach this independently and steer the project forward. You don’t need to know everything, but the plan should be grounded and reasonable.
 Time cap: 1–2 days max. Let us know if that sounds fair or if you'd prefer to tweak anything. We’re flexible, just aiming for something valuable and time-bounded.
--- a/scrape.py
+++ b/scrape.py
@ -0,0 +1,94 @@
 import requests
 import websockets
 import json
 import asyncio
 from pprint import pprint
 async def scrape():
    id_count = [0]
    def get_id():
        id_count[0] += 1
        return id_count[0]
    response = requests.get("http://localhost:3000/json")
    targets = response.json()
    if not targets:
        print("No active sessions found")
        return
    websocket_url = targets[0]['webSocketDebuggerUrl']
    print(f"Connecting to: {websocket_url}")
    async with websockets.connect(websocket_url) as ws:
        for elem in ["DOM", "Page"]:
            print("Enabling", elem)
            await ws.send(json.dumps({
                "id": get_id(),
                "method": f"{elem}.enable"
            }))
            # await asyncio.sleep(1)
            response = await ws.recv()
            print(f"{elem} enabled:", json.loads(response))
        print("Staring up")
        await ws.send(json.dumps({
            "id": get_id(),
            "method": "Page.navigate",
            # "params": {"url": "https://www.google.com/search?q=MINISFORUM+MS-A2"}
            "params": {"url": "https://ferano.io"}
        }))
        print("Send navigate request")
        while True:
            response = await ws.recv()
            data = json.loads(response)
            if data.get("method") == "Page.loadEventFired":
                break
        print("Got loadEventFired event")
        print("Get Document...")
        await ws.send(json.dumps({
            "id": get_id(),
            "method": "DOM.getDocument"
        }))
        print("Woot")
        document_id = id_count[0]  # Store the ID we just used
        while True:
            response = await ws.recv()
            data = json.loads(response)
            # Check if this is the response to our DOM.getDocument request
            if data.get("id") == document_id:
                root_node_id = data['result']['root']['nodeId']
                await ws.send(json.dumps({
                    "id": get_id(), 
                    "method": "DOM.getOuterHTML",
                    "params": {"nodeId": root_node_id}
                }))
                html_id = id_count[0]
                while True:
                    response = await ws.recv()
                    data = json.loads(response)
                    if data.get("id") == html_id and "result" in data:
                        html_content = data['result']['outerHTML']
                        print(html_content)
                        break
                    else:
                        print("Received event:", data)
                print("Something happened")
                break
        # response = await ws.recv()
        # root_data = json.loads(response)
        # root_node_id = root_data["result"]["root"]["nodeId"]
        # print(root_data)
 asyncio.run(scrape())