auto-scraper/Log.org

84 lines
4.0 KiB
Org Mode

* Setting up
After reading the following on https://hub.docker.com/r/browserless/chrome
#+begin_quote
Getting Chrome running well in docker is also a challenge as there's quiet a few packages you need in order to get Chrome running. Once that's done then there's still missing fonts, getting libraries to work with it, and having limitations on service reliability.
#+end_quote
Made me think twice about setting it up myself, so just grabbed this for now.
- I realized soon eough that ws://localhost:3000 is browserless' own API, so I went
and tried to figure out how to go about getting the websocket for the chrome
devtools, turns out I need to launch an instance first.
Browserless has an API but I went through the documentation and quickly felt
like it probably defeats the purpose of the exercise to use them, so I instead
used this;
https://hub.docker.com/r/zenika/alpine-chrome
Perhaps the exercise is looking for me to actually build an image from scratch,
but let's make progress on all other other tasks before tackling that.
I immediately hit bot detection when just running a normal websocket request to
the docker container, so I started researching what I would need to do to avoid detection.
Ok, so found this;
https://github.com/ultrafunkamsterdam/undetected-chromedriver/
This is how to pass brave to the URL
https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/806
I could set this up in the docker container, however, I'm not sure this is the
right thing.
I found this resource;
https://bot.incolumitas.com/#botChallenge
Ok, so it works! I was able to scrape google with the =driver.py= script!
I could use this, but let's see if I can just build the docker container myself.
https://hub.docker.com/r/ultrafunk/undetected-chromedriver
Setting up this with the underlying dockerfile, but I'm hitting this issue;
#+begin_quote
/app $ python driver.py /usr/bin/chromium-browser https://ferano.io
Traceback (most recent call last):
File "/app/driver.py", line 12, in <module>
driver = uc.Chrome(
^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 466, in __init__
super(Chrome, self).__init__(
File "/usr/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 47, in __init__ super().__init__(
File "/usr/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 69, in __init__
super().__init__(command_executor=executor, options=options)
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 261, in __init__
self.start_session(capabilities)
File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 724, in start_session
super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 362, in start_session
response = self.execute(Command.NEW_SESSION, caps)["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 454, in execute self.error_handler.check_response(response)
File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 232, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: cannot connect to chrome at 127.0.0.1:48747
from session not created: This version of ChromeDriver only supports Chrome version 138
Current browser version is 124.0.6367.78; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#sessionnotcreatedexception
#+end_quote
So now I need a new docker image, https://hub.docker.com/r/selenium/standalone-chrome
Updated the docker file. Now this works, I get back my websites HTML!
#+begin_src sh
docker exec -it search-api python driver.py /usr/bin/google-chrome https://ferano.io
#+end_src