84 lines
4.0 KiB
Org Mode
84 lines
4.0 KiB
Org Mode
* Setting up
|
|
|
|
After reading the following on https://hub.docker.com/r/browserless/chrome
|
|
|
|
#+begin_quote
|
|
Getting Chrome running well in docker is also a challenge as there's quiet a few packages you need in order to get Chrome running. Once that's done then there's still missing fonts, getting libraries to work with it, and having limitations on service reliability.
|
|
#+end_quote
|
|
|
|
Made me think twice about setting it up myself, so just grabbed this for now.
|
|
|
|
- I realized soon eough that ws://localhost:3000 is browserless' own API, so I went
|
|
and tried to figure out how to go about getting the websocket for the chrome
|
|
devtools, turns out I need to launch an instance first.
|
|
|
|
Browserless has an API but I went through the documentation and quickly felt
|
|
like it probably defeats the purpose of the exercise to use them, so I instead
|
|
used this;
|
|
|
|
https://hub.docker.com/r/zenika/alpine-chrome
|
|
|
|
Perhaps the exercise is looking for me to actually build an image from scratch,
|
|
but let's make progress on all other other tasks before tackling that.
|
|
|
|
I immediately hit bot detection when just running a normal websocket request to
|
|
the docker container, so I started researching what I would need to do to avoid detection.
|
|
|
|
|
|
|
|
|
|
Ok, so found this;
|
|
|
|
https://github.com/ultrafunkamsterdam/undetected-chromedriver/
|
|
|
|
This is how to pass brave to the URL
|
|
https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/806
|
|
|
|
I could set this up in the docker container, however, I'm not sure this is the
|
|
right thing.
|
|
|
|
|
|
I found this resource;
|
|
https://bot.incolumitas.com/#botChallenge
|
|
|
|
Ok, so it works! I was able to scrape google with the =driver.py= script!
|
|
|
|
I could use this, but let's see if I can just build the docker container myself.
|
|
|
|
https://hub.docker.com/r/ultrafunk/undetected-chromedriver
|
|
|
|
Setting up this with the underlying dockerfile, but I'm hitting this issue;
|
|
#+begin_quote
|
|
/app $ python driver.py /usr/bin/chromium-browser https://ferano.io
|
|
Traceback (most recent call last):
|
|
File "/app/driver.py", line 12, in <module>
|
|
driver = uc.Chrome(
|
|
^^^^^^^^^^
|
|
File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 466, in __init__
|
|
super(Chrome, self).__init__(
|
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 47, in __init__ super().__init__(
|
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 69, in __init__
|
|
super().__init__(command_executor=executor, options=options)
|
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 261, in __init__
|
|
self.start_session(capabilities)
|
|
File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 724, in start_session
|
|
super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
|
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 362, in start_session
|
|
response = self.execute(Command.NEW_SESSION, caps)["value"]
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 454, in execute self.error_handler.check_response(response)
|
|
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 232, in check_response
|
|
raise exception_class(message, screen, stacktrace)
|
|
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: cannot connect to chrome at 127.0.0.1:48747
|
|
from session not created: This version of ChromeDriver only supports Chrome version 138
|
|
Current browser version is 124.0.6367.78; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#sessionnotcreatedexception
|
|
#+end_quote
|
|
|
|
So now I need a new docker image, https://hub.docker.com/r/selenium/standalone-chrome
|
|
|
|
Updated the docker file. Now this works, I get back my websites HTML!
|
|
|
|
#+begin_src sh
|
|
docker exec -it search-api python driver.py /usr/bin/google-chrome https://ferano.io
|
|
#+end_src
|