Description
Concurrent Web Scraping with Selenium Grid and Docker Swarm
https://github.com/coding-to-music/selenium-grid-docker-swarm
Want to learn how to build this project?
Check out the blog post.
https://testdriven.io/blog/concurrent-web-scraping-with-selenium-grid-and-docker-swarm/
Want to use this project?
Fork/Clone
https://github.com/coding-to-music/selenium-grid-docker-swarm
Create and activate a virtual environment
apt-get install python-virtualenv
virtualenv -p python3 myApp
optionally use --no-site-packages
virtualenv --no-site-packages -p python3 myApp
source myApp/bin/activate
$ cd myapp/
$ source bin/activate
(myapp)debian@hostname:~/myapp$
Install the requirements
pip -r requirements.txt
Sign up for Digital Ocean and generate an access token
Add the token to your environment:
(env)$ export DIGITAL_OCEAN_ACCESS_TOKEN=[your_token]
Spin up four droplets and deploy Docker Swarm:
(env)$ sh project/create.sh
Run the scraper:
(env)$ docker-machine env node-1
(env)$ eval $(docker-machine env node-1)
(env)$ NODE=$(docker service ps --format "{{.Node}}" selenium_hub)
(env)$ for i in {1..8}; do {
python project/script.py
};
done
Bring down the resources:
(env)$ sh project/destroy.sh
source myApp/bin/activate