Install undetected-chromedriver on selenium/standalone-chrome image, I can scrape!
This commit is contained in:
parent
25e8cd49c4
commit
8d821c36af
15
Dockerfile
15
Dockerfile
@ -1,7 +1,14 @@
|
|||||||
FROM zenika/alpine-chrome:latest
|
FROM selenium/standalone-chrome:latest
|
||||||
|
|
||||||
|
USER root
|
||||||
|
|
||||||
|
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
RUN pip3 install --break-system-packages undetected-chromedriver
|
||||||
|
|
||||||
|
COPY driver.py /app/
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
# Expose port 3000 for remote debugging
|
|
||||||
EXPOSE 3000
|
EXPOSE 3000
|
||||||
|
|
||||||
# Override the default command to use port 3000
|
CMD ["google-chrome", "--headless", "--no-sandbox", "--disable-gpu", "--remote-debugging-port=3000", "--remote-debugging-address=0.0.0.0"]
|
||||||
CMD ["chromium-browser", "--headless", "--no-sandbox", "--disable-gpu", "--remote-debugging-port=3000", "--remote-debugging-address=0.0.0.0"]
|
|
||||||
|
42
Log.org
42
Log.org
@ -21,6 +21,9 @@ https://hub.docker.com/r/zenika/alpine-chrome
|
|||||||
Perhaps the exercise is looking for me to actually build an image from scratch,
|
Perhaps the exercise is looking for me to actually build an image from scratch,
|
||||||
but let's make progress on all other other tasks before tackling that.
|
but let's make progress on all other other tasks before tackling that.
|
||||||
|
|
||||||
|
I immediately hit bot detection when just running a normal websocket request to
|
||||||
|
the docker container, so I started researching what I would need to do to avoid detection.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@ -39,3 +42,42 @@ I found this resource;
|
|||||||
https://bot.incolumitas.com/#botChallenge
|
https://bot.incolumitas.com/#botChallenge
|
||||||
|
|
||||||
Ok, so it works! I was able to scrape google with the =driver.py= script!
|
Ok, so it works! I was able to scrape google with the =driver.py= script!
|
||||||
|
|
||||||
|
I could use this, but let's see if I can just build the docker container myself.
|
||||||
|
|
||||||
|
https://hub.docker.com/r/ultrafunk/undetected-chromedriver
|
||||||
|
|
||||||
|
Setting up this with the underlying dockerfile, but I'm hitting this issue;
|
||||||
|
#+begin_quote
|
||||||
|
/app $ python driver.py /usr/bin/chromium-browser https://ferano.io
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "/app/driver.py", line 12, in <module>
|
||||||
|
driver = uc.Chrome(
|
||||||
|
^^^^^^^^^^
|
||||||
|
File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 466, in __init__
|
||||||
|
super(Chrome, self).__init__(
|
||||||
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 47, in __init__ super().__init__(
|
||||||
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 69, in __init__
|
||||||
|
super().__init__(command_executor=executor, options=options)
|
||||||
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 261, in __init__
|
||||||
|
self.start_session(capabilities)
|
||||||
|
File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 724, in start_session
|
||||||
|
super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
|
||||||
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 362, in start_session
|
||||||
|
response = self.execute(Command.NEW_SESSION, caps)["value"]
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 454, in execute self.error_handler.check_response(response)
|
||||||
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 232, in check_response
|
||||||
|
raise exception_class(message, screen, stacktrace)
|
||||||
|
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: cannot connect to chrome at 127.0.0.1:48747
|
||||||
|
from session not created: This version of ChromeDriver only supports Chrome version 138
|
||||||
|
Current browser version is 124.0.6367.78; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#sessionnotcreatedexception
|
||||||
|
#+end_quote
|
||||||
|
|
||||||
|
So now I need a new docker image, https://hub.docker.com/r/selenium/standalone-chrome
|
||||||
|
|
||||||
|
Updated the docker file. Now this works, I get back my websites HTML!
|
||||||
|
|
||||||
|
#+begin_src sh
|
||||||
|
docker exec -it search-api python driver.py /usr/bin/google-chrome https://ferano.io
|
||||||
|
#+end_src
|
||||||
|
22
driver.py
22
driver.py
@ -1,17 +1,25 @@
|
|||||||
import undetected_chromedriver as uc
|
import undetected_chromedriver as uc
|
||||||
|
import sys
|
||||||
|
|
||||||
|
if len(sys.argv) < 3:
|
||||||
|
sys.exit("usage: driver.py <path-to-browser> <site-to-scrape>")
|
||||||
|
|
||||||
|
options = uc.ChromeOptions()
|
||||||
|
options.add_argument('--no-sandbox')
|
||||||
|
options.add_argument('--disable-dev-shm-usage')
|
||||||
|
options.add_argument('--disable-gpu')
|
||||||
|
|
||||||
driver = uc.Chrome(
|
driver = uc.Chrome(
|
||||||
browser_executable_path='/opt/brave.com/brave/brave',
|
browser_executable_path=sys.argv[1],
|
||||||
# headless=True,
|
headless=True,
|
||||||
# use_subprocess=False
|
use_subprocess=False,
|
||||||
|
options=options
|
||||||
)
|
)
|
||||||
driver.get('https://www.google.com/search?q=MINISFORUM+MS-A2')
|
driver.get(sys.argv[2])
|
||||||
driver.save_screenshot('nowsecure.png')
|
|
||||||
|
|
||||||
doc = await iframe_tab.send(cdp_generator("DOM.getDocument", {"depth": -1, "pierce": True}))
|
|
||||||
|
|
||||||
data = driver.execute_cdp_cmd('DOM.getDocument', {})
|
data = driver.execute_cdp_cmd('DOM.getDocument', {})
|
||||||
if data:
|
if data:
|
||||||
if 'root' in data:
|
if 'root' in data:
|
||||||
root_node_id = data['root']['nodeId']
|
root_node_id = data['root']['nodeId']
|
||||||
html = driver.execute_cdp_cmd('DOM.getOuterHTML', {"nodeId": root_node_id})
|
html = driver.execute_cdp_cmd('DOM.getOuterHTML', {"nodeId": root_node_id})
|
||||||
|
print(html)
|
||||||
|
7
image-rebuild.sh
Executable file
7
image-rebuild.sh
Executable file
@ -0,0 +1,7 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
|
||||||
|
docker stop search-api
|
||||||
|
docker rm search-api
|
||||||
|
docker build -t search-api .
|
||||||
|
docker run -d -p 3000:3000 --name search-api search-api
|
||||||
|
# docker exec -it search-api python driver.py /usr/bin/chromium-browser https://ferano.io
|
Loading…
x
Reference in New Issue
Block a user