Scrape my website with zenika/alpine-chrome
This commit is contained in:
commit
875b8c35ac
7
Dockerfile
Normal file
7
Dockerfile
Normal file
@ -0,0 +1,7 @@
|
|||||||
|
FROM zenika/alpine-chrome:latest
|
||||||
|
|
||||||
|
# Expose port 3000 for remote debugging
|
||||||
|
EXPOSE 3000
|
||||||
|
|
||||||
|
# Override the default command to use port 3000
|
||||||
|
CMD ["chromium-browser", "--headless", "--no-sandbox", "--disable-gpu", "--remote-debugging-port=3000", "--remote-debugging-address=0.0.0.0"]
|
25
Log.org
Normal file
25
Log.org
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
* Setting up
|
||||||
|
|
||||||
|
After reading the following on https://hub.docker.com/r/browserless/chrome
|
||||||
|
|
||||||
|
#+begin_quote
|
||||||
|
Getting Chrome running well in docker is also a challenge as there's quiet a few packages you need in order to get Chrome running. Once that's done then there's still missing fonts, getting libraries to work with it, and having limitations on service reliability.
|
||||||
|
#+end_quote
|
||||||
|
|
||||||
|
Made me think twice about setting it up myself, so just grabbed this for now.
|
||||||
|
|
||||||
|
- I realized soon eough that ws://localhost:3000 is browserless' own API, so I went
|
||||||
|
and tried to figure out how to go about getting the websocket for the chrome
|
||||||
|
devtools, turns out I need to launch an instance first.
|
||||||
|
|
||||||
|
Browserless has an API but I went through the documentation and quickly felt
|
||||||
|
like it probably defeats the purpose of the exercise to use them, so I instead
|
||||||
|
used this;
|
||||||
|
|
||||||
|
https://hub.docker.com/r/zenika/alpine-chrome
|
||||||
|
|
||||||
|
Perhaps the exercise is looking for me to actually build an image from scratch,
|
||||||
|
but let's make progress on all other other tasks before tackling that.
|
||||||
|
|
||||||
|
|
||||||
|
|
43
Task.org
Normal file
43
Task.org
Normal file
@ -0,0 +1,43 @@
|
|||||||
|
* Task
|
||||||
|
|
||||||
|
Build a Docker image that boots a minimal browser (Chromium, Firefox, Safari, or Edge all work). Then write a small script that uses the image to scrape the following URL:
|
||||||
|
|
||||||
|
https://www.google.com/search?q=MINISFORUM+MS-A2
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
|
||||||
|
- Accept optional proxy URL and optional browser launch flags
|
||||||
|
|
||||||
|
* Estimate and report:
|
||||||
|
|
||||||
|
- Cold start time
|
||||||
|
- Total transfer size (bandwidth over the wire)
|
||||||
|
- Time to response
|
||||||
|
- CPU and memory usage
|
||||||
|
|
||||||
|
- Save final HTML output to a file
|
||||||
|
- Use any language you're comfortable with
|
||||||
|
- We can provide a proxy URL, or you can use your own
|
||||||
|
|
||||||
|
* Goal:
|
||||||
|
|
||||||
|
Optimize for:
|
||||||
|
|
||||||
|
- Low latency
|
||||||
|
- Minimal bandwidth
|
||||||
|
- High success rate (avoid bans, captchas, etc.)
|
||||||
|
|
||||||
|
Then:
|
||||||
|
|
||||||
|
Write a short design doc (max 4 pages) outlining how you'd scale this to 10k concurrent requests. No need to detail measurement tooling just focus on next steps to evolve this into a full browser farm. Include:
|
||||||
|
|
||||||
|
- Fingerprinting and TLS shaping
|
||||||
|
- Crash recovery
|
||||||
|
- Session pooling and management
|
||||||
|
- Scaling and orchestration model
|
||||||
|
- Anti-bot defenses
|
||||||
|
- Unknowns and how you'd tackle them
|
||||||
|
|
||||||
|
We want to see how you'd approach this independently and steer the project forward. You don’t need to know everything, but the plan should be grounded and reasonable.
|
||||||
|
|
||||||
|
Time cap: 1–2 days max. Let us know if that sounds fair or if you'd prefer to tweak anything. We’re flexible, just aiming for something valuable and time-bounded.
|
94
scrape.py
Normal file
94
scrape.py
Normal file
@ -0,0 +1,94 @@
|
|||||||
|
import requests
|
||||||
|
import websockets
|
||||||
|
import json
|
||||||
|
import asyncio
|
||||||
|
from pprint import pprint
|
||||||
|
|
||||||
|
async def scrape():
|
||||||
|
id_count = [0]
|
||||||
|
def get_id():
|
||||||
|
id_count[0] += 1
|
||||||
|
return id_count[0]
|
||||||
|
|
||||||
|
response = requests.get("http://localhost:3000/json")
|
||||||
|
targets = response.json()
|
||||||
|
|
||||||
|
if not targets:
|
||||||
|
print("No active sessions found")
|
||||||
|
return
|
||||||
|
|
||||||
|
websocket_url = targets[0]['webSocketDebuggerUrl']
|
||||||
|
print(f"Connecting to: {websocket_url}")
|
||||||
|
|
||||||
|
async with websockets.connect(websocket_url) as ws:
|
||||||
|
for elem in ["DOM", "Page"]:
|
||||||
|
print("Enabling", elem)
|
||||||
|
await ws.send(json.dumps({
|
||||||
|
"id": get_id(),
|
||||||
|
"method": f"{elem}.enable"
|
||||||
|
}))
|
||||||
|
# await asyncio.sleep(1)
|
||||||
|
response = await ws.recv()
|
||||||
|
print(f"{elem} enabled:", json.loads(response))
|
||||||
|
|
||||||
|
print("Staring up")
|
||||||
|
|
||||||
|
await ws.send(json.dumps({
|
||||||
|
"id": get_id(),
|
||||||
|
"method": "Page.navigate",
|
||||||
|
# "params": {"url": "https://www.google.com/search?q=MINISFORUM+MS-A2"}
|
||||||
|
"params": {"url": "https://ferano.io"}
|
||||||
|
}))
|
||||||
|
|
||||||
|
print("Send navigate request")
|
||||||
|
|
||||||
|
while True:
|
||||||
|
response = await ws.recv()
|
||||||
|
data = json.loads(response)
|
||||||
|
if data.get("method") == "Page.loadEventFired":
|
||||||
|
break
|
||||||
|
|
||||||
|
print("Got loadEventFired event")
|
||||||
|
print("Get Document...")
|
||||||
|
|
||||||
|
await ws.send(json.dumps({
|
||||||
|
"id": get_id(),
|
||||||
|
"method": "DOM.getDocument"
|
||||||
|
}))
|
||||||
|
|
||||||
|
print("Woot")
|
||||||
|
|
||||||
|
|
||||||
|
document_id = id_count[0] # Store the ID we just used
|
||||||
|
while True:
|
||||||
|
response = await ws.recv()
|
||||||
|
data = json.loads(response)
|
||||||
|
|
||||||
|
# Check if this is the response to our DOM.getDocument request
|
||||||
|
if data.get("id") == document_id:
|
||||||
|
root_node_id = data['result']['root']['nodeId']
|
||||||
|
await ws.send(json.dumps({
|
||||||
|
"id": get_id(),
|
||||||
|
"method": "DOM.getOuterHTML",
|
||||||
|
"params": {"nodeId": root_node_id}
|
||||||
|
}))
|
||||||
|
|
||||||
|
html_id = id_count[0]
|
||||||
|
while True:
|
||||||
|
response = await ws.recv()
|
||||||
|
data = json.loads(response)
|
||||||
|
if data.get("id") == html_id and "result" in data:
|
||||||
|
html_content = data['result']['outerHTML']
|
||||||
|
print(html_content)
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
print("Received event:", data)
|
||||||
|
print("Something happened")
|
||||||
|
break
|
||||||
|
|
||||||
|
# response = await ws.recv()
|
||||||
|
# root_data = json.loads(response)
|
||||||
|
# root_node_id = root_data["result"]["root"]["nodeId"]
|
||||||
|
# print(root_data)
|
||||||
|
|
||||||
|
asyncio.run(scrape())
|
Loading…
x
Reference in New Issue
Block a user