OLJ Market Insights

A data analysis and visualization project using OnlineJobs.ph Job Listings dataset on Kaggle.

The project takes raw job listing data scraped from OnlineJobs.ph, cleans and transforms it through a series of Jupyter notebooks.

Exploration Secions

Salary Landscape — Salary distribution histograms, pay-range box plots by work type, and average pay rankings by category and subcategory (in USD and PHP).
Market Demands — Top 10 and Top 50 required skills, skills broken down by work type, and skill co-occurrence heatmaps (Top 15 and Top 25).
Work Types & Trends — Job distribution by work type (donut chart), monthly job posting volume over time, and weekly hours distribution.
Pay vs. Hours — Scatter plot correlating hours per week with average monthly salary, segmented by work type and currency.

Project Structure

olj-market-insights/
├── data/
│   ├── reamde # link to the data
│
├── notebooks/
│   ├── 1_cleaning.ipynb         # Salary filtering, fuzzy deduplication, export
│   ├── 2_transformation.ipynb   # Salary parsing, normalization, categorization
│   └── 3_exploration.ipynb      # Exploratory data analysis and visualizations
│
├── scripts/
│   ├── convert_to_csv.py        # Exports cleaned DB to CSV (extracts skills + dates)
│   └── csv_sampler.py           # Utility to draw a random sample from any CSV
│
├── utils/
│   ├── date_extract.py          # Parses "DATE UPDATED" from raw HTML job pages
│   └── skill_extract.py         # Parses "SKILL REQUIREMENT" tags from raw HTML
│
├── results/                     # Exported HTML renders of the notebooks
├── index.html                   # Standalone frontend dashboard
├── pyproject.toml
└── uv.lock

Data Pipeline

olj-jobs.db  (raw)
     │
     ▼
notebooks/1_cleaning.ipynb
  • Drop rows where salary currency cannot be identified (USD/PHP only)
  • Remove exact and near-duplicate listings (TF-IDF cosine similarity ≥ 0.80)
  • Export → olj-jobs-cleaned.db
     │
     ▼
scripts/convert_to_csv.py
  • Read olj_jobs table from cleaned DB
  • Extract skills and date_updated from raw HTML via BeautifulSoup
  • Export → data/olj-jobs-cleaned.csv
     │
     ▼
notebooks/2_transformation.ipynb
  • Parse and normalize salary strings (ranges, hourly, monthly, etc.)
  • Classify currency (USD / PHP) and salary type (hourly/monthly/weekly)
  • Add USD-normalized salary column
  • Classify job categories and subcategories
  • Export → data/transformed_jobs.csv

Setup

This project uses uv for dependency management and requires Python 3.13+.

1. Install dependencies

uv sync

2. Activate the virtual environment

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

3. Download the dataset

Download the raw dataset from Kaggle and place the SQLite database at data/olj-jobs.db:

https://www.kaggle.com/datasets/kennethjoshuabecaro/onlinejobs-ph-job-listings

Usage

Run the full pipeline

NOTE when I was doing the pipline it crashed my pc because it used around 30GB of ram, make sure your pc can handle this, if not use google colab to run the notebooks.

Execute the notebooks in order using Jupyter:

notebooks/1_cleaning.ipynb       → produces data/olj-jobs-cleaned.db
notebooks/2_transformation.ipynb → produces data/transformed_jobs.csv

Then run the extraction and analysis scripts:

py -m scripts.convert_to_csv

Utility scripts

Sample rows from a CSV for quick inspection:

# Sample 10 records (default)
py -m scripts.csv_sampler data/transformed_jobs.csv

# Sample 50 records with a fixed seed and a custom output file
py -m scripts.csv_sampler data/transformed_jobs.csv -n 50 -o data/sample/sample_50.csv --seed 42

Dependencies

Package	Purpose
`pandas`	Data loading, cleaning, and transformation
`numpy`	Statistical computations and binning
`scikit-learn`	TF-IDF vectorization for fuzzy deduplication
`beautifulsoup4`	HTML parsing for skill and date extraction
`plotly`	Interactive chart rendering (exploration notebooks)
`seaborn`	Statistical visualizations (exploration notebooks)

Dataset

The raw data was collected from OnlineJobs.ph, a platform for hiring Filipino remote workers. The Kaggle dataset used as the source is:

https://www.kaggle.com/datasets/kennethjoshuabecaro/onlinejobs-ph-job-listings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OLJ Market Insights

Exploration Secions

Project Structure

Data Pipeline

Setup

Usage

Run the full pipeline

Utility scripts

Dependencies

Dataset

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
results		results
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
index.html		index.html
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

OLJ Market Insights

Exploration Secions

Project Structure

Data Pipeline

Setup

Usage

Run the full pipeline

Utility scripts

Dependencies

Dataset

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages