uni_crawler

📖 Overview

uni_crawler automates the collection, validation, and export of university program data (including PEOs) from institutional websites. The goal is to make it easier for prospective students to compare offerings across schools and for analysts to work with structured, validated data.

Using Dagster, this project emphasizes a clean ETL design with clear separation of concerns:

Extract: scrape program and PEO information from university pages
Transform: normalize and clean text & schema
Validate: enforce schema, types, completeness, and deduplicate
Load (export): write tidy CSVs for analysis or future database loading

For a more detailed write-up of the project background and design choices, see the full documentation.

✨ Key Features

Two institutional sources supported initially: MSEUF and CEFI
Robust data quality checks: missing values, type validation, uniqueness, schema completeness
Modular Dagster assets: scraping, validation, and exporting are orchestrated independently
Reproducible environment: Python venv and requirements.txt
Monthly schedule (configurable): aligns with low-volatility academic updates

🛠️ Tech Stack

Language: Python
Web Scraping: BeautifulSoup4
Data Manipulation: Pandas
ETL Orchestration: Dagster
Version Control: Git/GitHub

👋 Prerequisites

To setup this project locally, ensure you have:

Requirement	Version
Python	3.11+
Git	2.30+

🏁 Getting Started

The following indicates a quick step-by-step to run the project using Command Prompt.

REM 1) Clone the repo (pick one)
git clone git@github.com:KubangPawis/uni-crawler.git
git clone https://github.com/KubangPawis/uni-crawler.git

REM 2) Go into the project folder
cd /d C:\path\to\uni_crawler

REM 3) Create and activate a Python 3.11 virtual environment
py -3.11 -m venv venv
venv/Scripts/Activate.ps1

REM 4) Install dependencies
python -m pip install -U pip
python -m pip install -r requirements.txt

REM 5) Launch Dagster's dev UI
dagster dev

🪙 Data Sources

Initial institutional sources:

Manuel S. Enverga University Foundation (MSEUF)
Calayan Education Foundation Inc. (CEFI)

These sites were chosen for their structured presentation of academic program information and breadth of programs.

⚙️ Pipeline

Below is an illustration of the project pipeline:

🔍 Pipeline Definition (Dagster)

The image below represents the pipeline job and definition managed through Dagster:

🎯 Outputs

Expected CSVs after a successful run:

univ_programs_data.csv – 5 columns (e.g., 135 records in a reference run)
univ_programs_peo_data.csv – 3 columns (e.g., 469 records in a reference run)

Files are written to the project’s data/output directory (check your repo paths/config).

✅ Ethical Web Scraping

Uni_crawler is designed to follow ethical practices:

Respect robots.txt and Terms of Service
Throttle requests (delays/backoff)
Collect only non-personal, publicly available information
Keep request volume reasonable

Scraped page patterns (examples):

MSEUF: /programs/, /programs/{campus_name}/{program_name}
CEFI: /programs/, /{department_name}

Please review individual site terms before adding new sources.

🛣️ Roadmap

Add more institutions/sources
Implement database load (PostgreSQL/MySQL)
Add CLI flags/config file for source selection and output paths
Publish a small API or dashboard for browsing programs/PEOs
CI checks for linting, type checks, and data quality gates

💁 Contributing

Contributions are welcome!

Fork the repo
Create a feature branch: git checkout -b feat/my-change
Commit with clear messages
Open a Pull Request describing your change and its rationale

👍 Acknowledgments

Built with Dagster, BeautifulSoup4, and Pandas
Thanks to the institutions whose publicly available pages make student research easier

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets/images		assets/images
docs		docs
outputs		outputs
scrapers		scrapers
uni_crawler		uni_crawler
uni_crawler_tests		uni_crawler_tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

uni_crawler

📖 Overview

✨ Key Features

🛠️ Tech Stack

👋 Prerequisites

🏁 Getting Started

🪙 Data Sources

⚙️ Pipeline

🔍 Pipeline Definition (Dagster)

🎯 Outputs

✅ Ethical Web Scraping

🛣️ Roadmap

💁 Contributing

👍 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

KubangPawis/uni-crawler

Folders and files

Latest commit

History

Repository files navigation

uni_crawler

📖 Overview

✨ Key Features

🛠️ Tech Stack

👋 Prerequisites

🏁 Getting Started

🪙 Data Sources

⚙️ Pipeline

🔍 Pipeline Definition (Dagster)

🎯 Outputs

✅ Ethical Web Scraping

🛣️ Roadmap

💁 Contributing

👍 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages