NewsPulse AI Scraper

Overview

NewsPulse AI Scraper is an automated web scraping system built with Python, Selenium, Beautiful Soup, and ScrapeGraphAI. It collects and summarizes articles from consulting firms' websites, extracting key information such as the title, author, publication date, and named entities.

Why I Created NewsPulse AI Scraper

created NewsPulse AI Scraper to automate the process of gathering and summarizing articles from consulting firms’ websites. Instead of manually browsing and extracting key insights, this tool leverages AI and web scraping to efficiently collect, analyze, and condense important information. By combining Selenium, Beautiful Soup, and OpenAI’s GPT model, the scraper provides structured data and concise summaries, making it easier to track industry trends and insights. This project enhances productivity, saves time, and helps professionals stay informed with minimal effort. 🚀

Features

🔍 Automated Web Scraping – Uses Selenium to dynamically load and scroll through web pages.
📝 Data Extraction – Parses article content, metadata, and named entities using Beautiful Soup and spaCy.
🤖 AI-Powered Summarization – Utilizes OpenAI's GPT model to generate concise article summaries.
📂 CSV Export – Saves extracted data in a structured CSV file for further analysis.

Technologies Used

Python
Selenium
Beautiful Soup
ScrapeGraphAI
OpenAI API (for content summarization)
spaCy (for named entity recognition)
Pandas (for data storage and export)

📂 Project Structure

This project consists of multiple scripts, each serving a unique purpose:

1️⃣ `mckinseyaiscraper.py`

Scrapes articles exclusively from the McKinsey website.
Uses Selenium and BeautifulSoup to extract article content, metadata, and named entities.
Summarizes the extracted content using OpenAI’s GPT model.
Outputs the results in a structured CSV file.

2️⃣ `googlenews.py`

Allows users to enter a prompt in the terminal specifying the type of articles they are looking for.
Searches Google News for the most relevant articles.
Extracts and returns the top search results in a CSV file format, including titles, URLs, and summaries.

Installation

Clone the repository

git clone git@github.com:yhuan331/Multi-Threaded-HTTP-Server.git
cd Multi-Threaded-HTTP-Server

Create and activate a virtual environment (optional but recommended)

python3 -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\Scripts\activate  # On Windows

Usage

a. Ensure you have a valid OpenAI API key and replace "your_openai_api_key_here" in the script with your actual API key. b. ensure you have all dependency installed c. Run the scraper:

python scraper.py

The script will:

Visit the McKinsey blog page
Scroll to load all articles
Extract article content and metadata
Use AI to summarize each article
Save the results to scraped_articles_summary.csv

Potential Enhancements 🚀

Expand support to other consulting firms' websites
Implement multi-threading for faster scraping
Improve AI summarization by fine-tuning models for news articles
Add a web interface for users to view and filter scraped articles

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
McKinsayAIScraper.py		McKinsayAIScraper.py
README.md		README.md
googlenews.py		googlenews.py
test.csv		test.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NewsPulse AI Scraper

Overview

Why I Created NewsPulse AI Scraper

Features

Technologies Used

📂 Project Structure

1️⃣ `mckinseyaiscraper.py`

2️⃣ `googlenews.py`

Installation

Clone the repository

Create and activate a virtual environment (optional but recommended)

Usage

The script will:

Potential Enhancements 🚀

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NewsPulse AI Scraper

Overview

Why I Created NewsPulse AI Scraper

Features

Technologies Used

📂 Project Structure

1️⃣ mckinseyaiscraper.py

2️⃣ googlenews.py

Installation

Clone the repository

Create and activate a virtual environment (optional but recommended)

Usage

The script will:

Potential Enhancements 🚀

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1️⃣ `mckinseyaiscraper.py`

2️⃣ `googlenews.py`

Packages