Scrape the title and headline using puppeteer + Flask JSON Server ( Mutli-build Docker Project)

This is repo is all about scraping the provided url headline and its title and then Dockerize the Whole as Docker Image.

Project Has Two Main Parts:

A Node.js Script :- That uses puppeteer to scrape a webpage.
A Flask Server :- That uses puppeteer to scrape a webpage.

Project File Structure

--| scrape.js: Node.js script file to scrape the website headline and title.
--| server.py: Python Flask app to show the scraped data at the homepage ('/').
--| package.json: For puppeteer dependencies.
--| requirements.txt: For python flask dependencies.
--| Dockerfile: To build and run everything inside Docker.

What you need to have on your system

PYTHON
NODE.JS
DOCKER

Commands and when to use

To install puppeteer

- run  `npm install puppeteer`

To set the environmental variable

for cmd

run set SCRAPE_URL=https://wikipedia.org

for powershell

-run $env:SCRAPE_URL="https://wikipedia.org"

for bash

run export SCRAPE_URL="https://wikipedia.org"

To run the scrape.js

run node scrape.js

To run the Flask server

run python server.py

Once the full file setup has been done we can see that a new file has been created in our current working folder named as scraped_data.json in json format that collect the headline and titile of the provided url. This scraped data will be rendered on our flask app server.

To see the result once you will run pyhton server.py a https link will be provided o your local machine click the link and see the scraped data in json format on your browser.

To create the Docker image

You should have the Dockerfile which will be used to build the docker image on your system

run docker build -t scraper-server . this command will build the image.
run docker run -p 5000:5000 -e SCRAPE_URL=https://wikipedia.org scraper-server this command will run the container
-p 5000:5000 = maps the app to your local machine. -p stand as port
-e SCRAPE_URL=... tells it which website to scrape. -e stand for environmental variable

Check weather your conatainer is running or not

run docker ps to see all the running container on your system.

once you verified that the container is running visit you browser and in your url type https://localhost:5000 as we have expose the port 5000 for the app. We will see the scraped headline and title of thr provided url in json format.

To stop the container and remove the image

-run docker ps you will find the container name and its id -then run docker stop id this will stop the container.

to remove the image docker rmi image-name e.g scraper-server.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
Readme.md		Readme.md
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
scrape.js		scrape.js
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrape the title and headline using puppeteer + Flask JSON Server ( Mutli-build Docker Project)

Project Has Two Main Parts:

Project File Structure

What you need to have on your system

Commands and when to use

To install puppeteer

To set the environmental variable

To run the scrape.js

To run the Flask server

To create the Docker image

To stop the container and remove the image

About

Uh oh!

Releases

Packages

Uh oh!

Languages

PrabhatYadav-27/scraper-project

Folders and files

Latest commit

History

Repository files navigation

Scrape the title and headline using puppeteer + Flask JSON Server ( Mutli-build Docker Project)

Project Has Two Main Parts:

Project File Structure

What you need to have on your system

Commands and when to use

To install puppeteer

To set the environmental variable

To run the scrape.js

To run the Flask server

To create the Docker image

To stop the container and remove the image

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages