Skip to content

My very first Data Analytics project: A deep dive into Airbnb data cleaning and feature engineering. Focused on building a solid, professional foundation in Python & Big Data preprocessing.

Notifications You must be signed in to change notification settings

qtracie/airbnb-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Python Version Completed

Airbnb Data Cleaning & Preprocessing Project πŸ§ΉπŸ“Š

Welcome! This is my very first professional-grade project in the field of Data Analytics.

It represents my initial milestone in mastering the data science pipeline, where I focused intensely on the most critical phase: Data Cleaning and Preprocessing.

πŸ“Œ Project Objective

In real-world analytics, the quality of insights depends entirely on the quality of the data. This project demonstrates a systematic approach to handling missing values, outliers, and complex string manipulations to prepare data for future predictive modeling.

πŸ› οΈ Tech Stack

  • Language: Python
  • Primary Library: Pandas (Data Manipulation)
  • Supporting Libraries: NumPy, Seaborn, Matplotlib (Visualization of Data Quality)

πŸš€ Data Cleaning & Preprocessing Workflow

1. Initial Data Audit & Target Refinement

  • Target Cleaning: Identified and removed records with null values in the price column to ensure dataset reliability.
  • Handling Missing Values: - Performed a row-wise and column-wise null analysis.
    • Dropped columns with a high percentage of missing values (exceeding 50-60%) that couldn't provide meaningful signals.
    • Filtered out rows with critical missing information to reduce noise and improve data density.
  • Redundancy Removal: Dropped irrelevant high-cardinality columns (URLs, IDs, scrape dates) to optimize memory usage.

2. Advanced Imputation Strategies

  • Numerical Features: Applied median imputation to handle skewed distributions and outliers(e.g., host_listings_count).
  • Categorical Features: Logically filled missing boolean indicators (e.g., host_is_superhost).
  • Percentage Fields: Processed string-based percentages into numerical values and handled "Unknown" categories for missing response data.

3. Feature Engineering (Value Extraction)

  • Text Feature Extraction: Created binary keyword indicators from listing descriptions (e.g., "spacious", "beach", "luxury").
  • Temporal Features: Engineered host seniority (years active) and recency of reviews from date fields.
  • Categorical Binning: Grouped continuous variables (like response rates) into discrete bins to simplify the feature space and improve model interpretability.

4. Data Distribution & Consistency

  • Standardized boolean values 't'/'f' into binary numeric format.
  • Used Histograms to verify that the cleaning process did not introduce biases in the review score distributions.

πŸ“Š Final Result

The output of this project is a Cleaned Dataset with:

  • Zero missing values in critical feature columns.
  • 10+ new engineered features extracted from raw text and dates.
  • Optimized data types for efficient modeling and analysis. This cleaned dataset is ready for applications such as price prediction, demand forecasting, and host performance analysis.

πŸ—οΈ How to Use

  1. Clone the repo:
    git clone [https://github.com/qtracie/airbnb-data.git](https://github.com/qtracie/airbnb-data.git)
  2. Install dependencies:
    pip install -r requirements.txt
  3. Execute the script:
    python notebooks/airbnb-analysis.py

About

My very first Data Analytics project: A deep dive into Airbnb data cleaning and feature engineering. Focused on building a solid, professional foundation in Python & Big Data preprocessing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages