Music Data ETL Processor

Project by Jackson Le, Joanne Lee, Kenny Nguyen.

Project Overview

This project processes, cleans, enriches, and merges Grammy nominees/winners, Billboard Hot 100 songs, and Spotify metadata to analyze songs with both commercial and critical success.

Important Files

Cleaning.py: Cleans both Kaggle datasets.
Hot100andGrammysMerging.py: Merges both Kaggle datasets.
Main.py: Enriches dataset, allows modifications.
API.py: Handles Spotify API calls & column functions.
project_data_science_systems.ipynb: Contains SQLite database setup & data analysis & visualizations.
hot100_grammys_merged.csv: Cleaned and merged dataset containing the two from Kaggle.
api_kaggle_final.csv: Final dataset containing all three sources.

How to Use

Data Wrangling:
Run Cleaning.py to clean both the Grammy & Hot100 datasets. Then run Hot100andGrammysMerging.py, this will output hot100_grammys_merged.csv.
Enrichment:
Run main.py to enrich the merged dataset with Spotify API song data, apply any column modifications, and handle CSV/JSON conversions. The final output is api_kaggle_final.csv.
Database Insertion & Analysis:
Open project_data_science_systems.ipynb to load the cleaned datasets into an SQLite database and perform further analysis and visualization.

ETL Processor (main.py) Explanation

The main.py file serves as the core of the ETL processor. Here's how to use it:

Spotify API Integration:
Running main.py will fetch audio features from Spotify based on the songs in hot100_grammys_merged.csv.
Data Modification:
You can also modify the dataset by adding new columns and values, updating existing values or deleting columns.
Format Conversion:
You can also convert the dataset between CSV and JSON, depending on your desired format.
Final Output:
The final dataset is then saved as Api_kaggle_final.csv, ready for analysis or to be inserted into a database.

Additional modification were done below the Api_kaggle_final.csv, for file that gets updated column, we saved it as updated_modify_after_added.csv. For file that didn't get a new column but gets a value changed, it is saved as modify_update_value.csv. For deleting column, we saved it as modify_delete.csv

Jupyter Notebook Explanation

The project_data_science_systems.ipynb contains the SQLite database of the final dataset.

Creating Table & Reading CSV: The SQLite Database has a table called Songs that holds information such as song (name), artist, peak position, week in charts, category, winner, duration in milliseconds, explicit, and popularity. The data that is inserted into this database is the final dataset Api_kaggle_final.csv.
Summary Ingested Data: You can get the summary of the ingested data which gives you the number of records and columns.
Summary Transformed Data: You can get the summary of the transformed data by removing duplicates and getting the number of records and columns after deduplication.

Conclusions from Data Visualization and Analysis

The project_data_science_systems.ipynb file serves as the SQLite database and visualization:

Basic Analysis:
You can see the first 5 rows of the dataset, we can see that the first 5 rows were mostly not Grammy winners. You can see the number of songs by category and here we can see that most of the songs were Record of the Year. You can see the average popularity by category. Here we can see that Best Pop Duo/Group Performance was the most popular.
Number of Charting Songs in each Grammy Category:
You can see that most of the charting songs were in the Record of the Year category.
Average Weeks on Charts: Grammy Winners vs Non-Winners:
You can see that there is not much discrepency between the Grammy Winners and Non-Winners on the average weeks their songs were on charts.
Weeks on Chart Distribution (Grammy Winners vs. Non-Winners):
You can see that Grammy Non-Winners had roughly 2 songs that had ~27 weeks on chart which was the highest. Overall, Grammy Non-Winners had more songs on charts than Grammy Winners.
Conclusion:
With the initial data exploration and analysis we’ve performed, there is no immediate strong relationship between the critical success and commercial success as seen in the average weeks on charts graph where Grammy Winners and Non-Winners are approximately equal. However, Grammy-Winning songs have a more distinct separation of distribution in the number of weeks they charted with many being less than 5 or greater than 15. On the other hand, Grammy Non-Winners have more distribution throughout all the weeks. Realistically, the Grammy categories that would directly relate to the Hot 100 charts would be genres that have been historically popular such as rap, pop, and country.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.idea		.idea
API.py		API.py
Api_kaggle_final.csv		Api_kaggle_final.csv
Cleaning.py		Cleaning.py
Data Project 1_ Reflection Report.pdf		Data Project 1_ Reflection Report.pdf
Hot100andGrammysMerging.py		Hot100andGrammysMerging.py
Main.py		Main.py
README.md		README.md
grammys2025.csv		grammys2025.csv
hot100.csv		hot100.csv
hot100_grammys_merged.csv		hot100_grammys_merged.csv
json_but_back_to_csv		json_but_back_to_csv
modify_delete.csv		modify_delete.csv
modify_update_value.csv		modify_update_value.csv
originalGrammys2025.csv		originalGrammys2025.csv
originalHot100.csv		originalHot100.csv
outputjson_after_conversion.json		outputjson_after_conversion.json
project_data_science_systems.ipynb		project_data_science_systems.ipynb
updated_modify.csv		updated_modify.csv
updated_modify_after_added.csv		updated_modify_after_added.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Data ETL Processor

Project Overview

Important Files

How to Use

ETL Processor (main.py) Explanation

Jupyter Notebook Explanation

Conclusions from Data Visualization and Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Music Data ETL Processor

Project Overview

Important Files

How to Use

ETL Processor (main.py) Explanation

Jupyter Notebook Explanation

Conclusions from Data Visualization and Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages