auto-scraper/Log.org

4.0 KiB

Setting up

After reading the following on https://hub.docker.com/r/browserless/chrome

Getting Chrome running well in docker is also a challenge as there's quiet a few packages you need in order to get Chrome running. Once that's done then there's still missing fonts, getting libraries to work with it, and having limitations on service reliability.

Made me think twice about setting it up myself, so just grabbed this for now.

  • I realized soon eough that ws://localhost:3000 is browserless' own API, so I went and tried to figure out how to go about getting the websocket for the chrome devtools, turns out I need to launch an instance first.

Browserless has an API but I went through the documentation and quickly felt like it probably defeats the purpose of the exercise to use them, so I instead used this;

https://hub.docker.com/r/zenika/alpine-chrome

Perhaps the exercise is looking for me to actually build an image from scratch, but let's make progress on all other other tasks before tackling that.

I immediately hit bot detection when just running a normal websocket request to the docker container, so I started researching what I would need to do to avoid detection.

Ok, so found this;

https://github.com/ultrafunkamsterdam/undetected-chromedriver/

This is how to pass brave to the URL https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/806

I could set this up in the docker container, however, I'm not sure this is the right thing.

I found this resource; https://bot.incolumitas.com/#botChallenge

Ok, so it works! I was able to scrape google with the driver.py script!

I could use this, but let's see if I can just build the docker container myself.

https://hub.docker.com/r/ultrafunk/undetected-chromedriver

Setting up this with the underlying dockerfile, but I'm hitting this issue;

/app $ python driver.py /usr/bin/chromium-browser https://ferano.io Traceback (most recent call last): File "/app/driver.py", line 12, in <module> driver = uc.Chrome( ^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 466, in init super(Chrome, self).__init__( File "/usr/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 47, in init super().__init__( File "/usr/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 69, in init super().__init__(command_executor=executor, options=options) File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 261, in init self.start_session(capabilities) File "/usr/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 724, in start_session super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session( File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 362, in start_session response = self.execute(Command.NEW_SESSION, caps)["value"] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 454, in execute self.error_handler.check_response(response) File "/usr/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 232, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.SessionNotCreatedException: Message: session not created: cannot connect to chrome at 127.0.0.1:48747 from session not created: This version of ChromeDriver only supports Chrome version 138 Current browser version is 124.0.6367.78; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#sessionnotcreatedexception

So now I need a new docker image, https://hub.docker.com/r/selenium/standalone-chrome

Updated the docker file. Now this works, I get back my websites HTML!

docker exec -it search-api python driver.py /usr/bin/google-chrome https://ferano.io