Welcome! This is my very first professional-grade project in the field of Data Analytics.
It represents my initial milestone in mastering the data science pipeline, where I focused intensely on the most critical phase: Data Cleaning and Preprocessing.
In real-world analytics, the quality of insights depends entirely on the quality of the data. This project demonstrates a systematic approach to handling missing values, outliers, and complex string manipulations to prepare data for future predictive modeling.
- Language: Python
- Primary Library:
Pandas(Data Manipulation) - Supporting Libraries:
NumPy,Seaborn,Matplotlib(Visualization of Data Quality)
- Target Cleaning: Identified and removed records with null values in the
pricecolumn to ensure dataset reliability. - Handling Missing Values: - Performed a row-wise and column-wise null analysis.
- Dropped columns with a high percentage of missing values (exceeding 50-60%) that couldn't provide meaningful signals.
- Filtered out rows with critical missing information to reduce noise and improve data density.
- Redundancy Removal: Dropped irrelevant high-cardinality columns (URLs, IDs, scrape dates) to optimize memory usage.
- Numerical Features: Applied median imputation to handle skewed distributions and outliers(e.g.,
host_listings_count). - Categorical Features: Logically filled missing boolean indicators (e.g.,
host_is_superhost). - Percentage Fields: Processed string-based percentages into numerical values and handled "Unknown" categories for missing response data.
- Text Feature Extraction: Created binary keyword indicators from listing descriptions (e.g., "spacious", "beach", "luxury").
- Temporal Features: Engineered host seniority (years active) and recency of reviews from date fields.
- Categorical Binning: Grouped continuous variables (like response rates) into discrete bins to simplify the feature space and improve model interpretability.
- Standardized boolean values 't'/'f' into binary numeric format.
- Used Histograms to verify that the cleaning process did not introduce biases in the review score distributions.
The output of this project is a Cleaned Dataset with:
- Zero missing values in critical feature columns.
- 10+ new engineered features extracted from raw text and dates.
- Optimized data types for efficient modeling and analysis. This cleaned dataset is ready for applications such as price prediction, demand forecasting, and host performance analysis.
- Clone the repo:
git clone [https://github.com/qtracie/airbnb-data.git](https://github.com/qtracie/airbnb-data.git)
- Install dependencies:
pip install -r requirements.txt
- Execute the script:
python notebooks/airbnb-analysis.py