-
Couldn't load subscription status.
- Fork 0
This is a folder project that includes a comprehensive Data Engineering portfolio. In this page you can explore database creation, managed analysis, visualization and targeted manipulation.
Couldn't load subscription status.
aldoMsc/Data-Engineering-and-Analysis-City-Air-Quality-API-Extraction-Portfolio-Full-Project
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Air Quality Bristol API ETL & Visualization Project A Python-based ETL pipeline that extracts air quality monitoring data from the official Bristol City Council API, processes and transforms the data, and loads the results into a PostgreSQL database using Apache Spark. The project also includes data normalization, geospatial coordinate handling, and visualization of monitoring locations across the city of Bristol. Project Overview This project demonstrates a complete data engineering and analytics workflow: Extract: Data is collected from an online public API in GeoJSON format. Transform: Geospatial data is processed using GeoPandas. Coordinates are normalized using MinMaxScaler. Cleaned and transformed data is stored in both CSV and PostgreSQL database formats. Load: The processed data is loaded into a PostgreSQL database via Apache Spark. Visualize: Monitor site locations are visualized using Matplotlib on a normalized coordinate grid. Technologies Used Python 3.x GeoPandas Pandas Scikit-learn Matplotlib Apache Spark (PySpark) PostgreSQL JDBC Connector Shapely Data Source Source: Bristol City Council Air Quality API Format: GeoJSON containing monitoring station data for Bristol City. Project Workflow API Call: Fetches live air quality data in GeoJSON format. Read & Parse GeoJSON: Loads data into a GeoPandas DataFrame. Coordinate Normalization: Scales longitude and latitude values between 0 and 1. Data Transformation: Adds normalized coordinates. Updates geometry points. Selects and renames relevant columns. Export to CSV: Saves the transformed dataset. Load into PostgreSQL via Apache Spark: Ingests final data into a database table. Visualization: Plots normalized monitoring locations on a scatter plot. Summary Statistics: Counts unique monitoring locations in the dataset. Example Visualization A scatter plot showing the distribution of air quality monitoring stations in Bristol, using normalized coordinate values. How to Run Install required Python libraries: bash Copy Edit pip install pandas geopandas scikit-learn matplotlib pyspark psycopg2-binary Set up your PostgreSQL database and update the connection credentials in the script. Run the Python script to execute the complete ETL pipeline and generate the visual output. Key Skills Demonstrated API Integration & Data Extraction GeoJSON Data Handling Geospatial Data Normalization Apache Spark Data Processing PostgreSQL Data Ingestion via JDBC Data Visualization End-to-End ETL Workflow
About
This is a folder project that includes a comprehensive Data Engineering portfolio. In this page you can explore database creation, managed analysis, visualization and targeted manipulation.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published