Skip to content

Latest commit

 

History

History
114 lines (64 loc) · 3.25 KB

File metadata and controls

114 lines (64 loc) · 3.25 KB

Layoffs Data Cleaning (SQL + Pandas Project)

This project demonstrates a complete data-cleaning workflow on a real-world layoffs dataset using both MySQL and Pandas (Python). The dataset contains information on company layoffs across industries, including company names, locations, industries, funding, and layoff counts.

The goal is to transform the raw dataset (with duplicates, inconsistent formatting, and missing values) into a clean, standardized version ready for analysis or visualization.

Tools Used:

MySQL → Data cleaning & transformation (SQL queries)

MySQL Workbench → Query execution & testing

Pandas (Python) → Alternative data-cleaning pipeline

GitHub → Project versioning & sharing

Data Cleaning Steps:

1. Remove Duplicates

SQL: Used ROW_NUMBER() OVER(PARTITION BY …) to detect duplicates and deleted them.

Pandas: Used df.drop_duplicates() to remove exact duplicate rows.

2. Standardize Data

Trimmed extra spaces from company names.

Standardized industry names (e.g., "Crypto%" → "Crypto").

Corrected country names ("United States%" → "United States").

Converted date column from text → proper DATE / datetime format.

SQL Example:

update layoffs_staging set company = trim(company);

update layoffs_staging set industry = 'Crypto' where industry like 'Crypto%';

Pandas Example:

df["company"] = df["company"].str.strip() df["industry"] = df["industry"].replace(r"^Crypto.", "Crypto", regex=True) df["country"] = df["country"].replace(r"United States.", "United States", regex=True) df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y", errors="coerce")

3. Handle Null or Missing Values

Replaced missing industry values using self-join (SQL) or groupby().transform(ffill().bfill()) (Pandas).

Deleted rows where both total_laid_off and percentage_laid_off were null.

SQL Example:

update layoffs_staging t1 join layoffs_staging t2 on t1.company = t2.company set t1.industry = t2.industry where (t1.industry is null or t1.industry = '') and t2.industry is not null;

Pandas Example:

df["industry"] = df.groupby("company")["industry"].transform(lambda x: x.ffill().bfill()) df = df.dropna(subset=["total_laid_off", "percentage_laid_off"], how="all")

4. Final Cleanup

Dropped helper columns (e.g., row_num in SQL).

Ensured consistent column types (DATE, INT, VARCHAR / Pandas dtypes).

Exported cleaned dataset to CSV.

df.to_csv("cleaned_layoffs_by_pandas.csv", index=False)

Final Output

Raw data: Contained duplicates, inconsistent text, and null values.

Cleaned data: Standardized company/industry names, formatted dates, removed duplicates, minimized nulls.

Final dataset:

Cleaned_layoffs.csv (from SQL pipeline)

cleaned_layoffs_by_pandas.csv (from Pandas pipeline)

Next Steps

Perform Exploratory Data Analysis (EDA) in Python or SQL.

Build dashboards in Tableau / Power BI / Looker Studio.

Analyze layoff trends by year, industry, country, funding stage.

Key Learnings

How to structure a SQL-based data-cleaning pipeline.

How to replicate the same cleaning process in Pandas.

Importance of standardization in categorical data.

Handling duplicates and null values in real-world datasets.

Author: Muhammad Akbar Khan

Master’s student in Applied Mathematics | Data Enthusiast