Skip to content

Automated pipeline that ingests news data from a public API, stores it in AWS S3, and loads it into Snowflake using Snowpipe. The process is orchestrated using Apache Airflow and ends with structured summary tables in Snowflake.

Notifications You must be signed in to change notification settings

varshithchilagani/Event-Driven-Snowflake-Data-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

50 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Event-Driven News Data Ingestion and Structuring in Snowflake using Snowpipe, S3, and Airflow

This project builds an automated, event-driven data pipeline that fetches news articles from a public API, stores the data in AWS S3 as Parquet files, and ingests it into Snowflake using Snowpipe. The entire pipeline is orchestrated and scheduled using Apache Airflow. The final data is structured into meaningful summary tables inside Snowflake.


πŸ› οΈ Tech Stack

  • Python – for API integration and data cleaning
  • AWS S3 – for raw Parquet file storage
  • Snowflake – for data staging, storage, and transformation
  • Snowpipe – for continuous, automated data ingestion
  • Apache Airflow – for pipeline scheduling and orchestration
  • NewsAPI – data source (news articles)
  • Docker – deploy Airflow locally for testing and reproducibility

πŸ“Š Architecture

Architecture Diagram


πŸ“‚ Folder Structure

event-driven-snowflake-data-pipeline/
β”‚
β”œβ”€β”€ dags/
β”‚ └── news_pipeline_airflow_dag.py
β”‚
β”œβ”€β”€ scripts/
β”‚ └── fetch_news_etl_job.py 
β”‚ └── snowflake_commands.sql 
β”‚ └── requirements.txt
β”‚
β”œβ”€β”€ docs/
β”‚ └── architecture_diagram.png
β”‚ └── how_to_run.md
β”‚ └── project_demo_video_link
β”‚ └── airflow_dag_image.png
β”‚ └── summary_news_table_output.png
β”‚ └── author_activity_table_output.png
β”œβ”€β”€ README.md 

βš™οΈ Pipeline Overview

  1. Airflow triggers the DAG daily
  2. fetch_news_etl_job.py:
    • Pulls news from the NewsAPI
    • Cleans and formats the data
    • Saves as a .parquet file
    • Uploads to AWS S3
  3. Snowpipe listens to S3 and ingests the Parquet file into Snowflake
  4. Airflow triggers SnowflakeOperator SQL tasks:
    • Creates staging and final tables
    • Populates summary tables (summary_news, author_activity)

πŸ“ SQL Logic (in Snowflake)

  • Create a stage using S3
  • Auto-create table using INFER_SCHEMA
  • Load data using COPY INTO
  • Create summary tables:
    • summary_news: article count per news source
    • author_activity: article count per author

πŸ“Š Pipeline Execution Summary

  • βœ… Successfully ingested ~997 records into the raw news_data table from S3 throuh an external stage per run
  • βœ… Generated 2 transformed tables:
    • summary_news containing 80 records
    • author_activity containing 113 records
  • πŸ” DAG is scheduled to run daily
  • 🧱 pipeline can be easily scaled to handle larger datasets (e.g., 50K+ records) with minimal config changes

Airflow dag graph image and demo video

The image below shows how the tasks are orchestrated in the Airflow DAG.

airflow_dag_image

You can check the demo video by clicking on the image below:

watch the airflow orchestrated dag demo


Resources

The following resources are available in the docs/ folder of this repository:

  • Architecture Diagram – High-level visual of the data pipeline architecture
  • Airflow_dag_image – Visual representation of the DAG execution flow
  • how_to_run.md – Step-by-step instructions to set up and run this project locally
  • project_demo_video_link – Link to the project demo video

You can open the docs/ folder to view all attached guides and visual assets.


πŸ” Confidential Information Notice

For security reasons, this repository does not include any real credentials or sensitive information.

The following values have been masked, replaced, or removed in the shared scripts:

  • NewsAPI Key
  • AWS S3 Bucket Name
  • AWS IAM Role ARN
  • Snowflake Account & Connection Details

If you're running this project yourself, please replace these placeholders with your actual values. Refer to the HOW_TO_RUN.md file for guidance.


πŸ‘¨β€πŸ’» Author

Varshith Chilagani
πŸ”— Linkedin Profile

About

Automated pipeline that ingests news data from a public API, stores it in AWS S3, and loads it into Snowflake using Snowpipe. The process is orchestrated using Apache Airflow and ends with structured summary tables in Snowflake.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages