This is repo is all about scraping the provided url headline and its title and then Dockerize the Whole as Docker Image.
- A Node.js Script :- That uses puppeteer to scrape a webpage.
- A Flask Server :- That uses puppeteer to scrape a webpage.
- --|
scrape.js: Node.js script file to scrape the website headline and title. - --|
server.py: Python Flask app to show the scraped data at the homepage ('/'). - --|
package.json: For puppeteer dependencies. - --|
requirements.txt: For python flask dependencies. - --|
Dockerfile: To build and run everything inside Docker.
- PYTHON
- NODE.JS
- DOCKER
- run `npm install puppeteer`
for cmd
- run
set SCRAPE_URL=https://wikipedia.org
for powershell
-run $env:SCRAPE_URL="https://wikipedia.org"
for bash
- run
export SCRAPE_URL="https://wikipedia.org"
- run
node scrape.js
- run
python server.py
Once the full file setup has been done we can see that a new file has been created in our current working folder named as scraped_data.json in json format that collect the headline and titile of the provided url. This scraped data will be rendered on our flask app server.
To see the result once you will run pyhton server.py a https link will be provided o your local machine click the link and see the scraped data in json format on your browser.
You should have the Dockerfile which will be used to build the docker image on your system
-
run
docker build -t scraper-server .this command will build the image. -
run
docker run -p 5000:5000 -e SCRAPE_URL=https://wikipedia.org scraper-serverthis command will run the container -
-p 5000:5000 = maps the app to your local machine. -p stand as port
-
-e SCRAPE_URL=... tells it which website to scrape. -e stand for environmental variable
Check weather your conatainer is running or not
- run
docker psto see all the running container on your system.
once you verified that the container is running visit you browser and in your url type https://localhost:5000
as we have expose the port 5000 for the app. We will see the scraped headline and title of thr provided url in
json format.
-run docker ps you will find the container name and its id
-then run docker stop id this will stop the container.
- to remove the image
docker rmi image-namee.g scraper-server.