uni_crawler automates the collection, validation, and export of university program data (including PEOs) from institutional websites. The goal is to make it easier for prospective students to compare offerings across schools and for analysts to work with structured, validated data.
Using Dagster, this project emphasizes a clean ETL design with clear separation of concerns:
- Extract: scrape program and PEO information from university pages
- Transform: normalize and clean text & schema
- Validate: enforce schema, types, completeness, and deduplicate
- Load (export): write tidy CSVs for analysis or future database loading
For a more detailed write-up of the project background and design choices, see the full documentation.
- Two institutional sources supported initially: MSEUF and CEFI
- Robust data quality checks: missing values, type validation, uniqueness, schema completeness
- Modular Dagster assets: scraping, validation, and exporting are orchestrated independently
- Reproducible environment: Python venv and
requirements.txt - Monthly schedule (configurable): aligns with low-volatility academic updates
- Language: Python
- Web Scraping: BeautifulSoup4
- Data Manipulation: Pandas
- ETL Orchestration: Dagster
- Version Control: Git/GitHub
To setup this project locally, ensure you have:
| Requirement | Version |
|---|---|
| Python | 3.11+ |
| Git | 2.30+ |
The following indicates a quick step-by-step to run the project using Command Prompt.
REM 1) Clone the repo (pick one)
git clone git@github.com:KubangPawis/uni-crawler.git
git clone https://github.com/KubangPawis/uni-crawler.git
REM 2) Go into the project folder
cd /d C:\path\to\uni_crawler
REM 3) Create and activate a Python 3.11 virtual environment
py -3.11 -m venv venv
venv/Scripts/Activate.ps1
REM 4) Install dependencies
python -m pip install -U pip
python -m pip install -r requirements.txt
REM 5) Launch Dagster's dev UI
dagster devInitial institutional sources:
- Manuel S. Enverga University Foundation (MSEUF)
- Calayan Education Foundation Inc. (CEFI)
These sites were chosen for their structured presentation of academic program information and breadth of programs.
Below is an illustration of the project pipeline:
The image below represents the pipeline job and definition managed through Dagster:
Expected CSVs after a successful run:
-
univ_programs_data.csv – 5 columns (e.g., 135 records in a reference run)
-
univ_programs_peo_data.csv – 3 columns (e.g., 469 records in a reference run)
Files are written to the project’s data/output directory (check your repo paths/config).
Uni_crawler is designed to follow ethical practices:
- Respect robots.txt and Terms of Service
- Throttle requests (delays/backoff)
- Collect only non-personal, publicly available information
- Keep request volume reasonable
Scraped page patterns (examples):
- MSEUF: /programs/, /programs/{campus_name}/{program_name}
- CEFI: /programs/, /{department_name}
Please review individual site terms before adding new sources.
- Add more institutions/sources
- Implement database load (PostgreSQL/MySQL)
- Add CLI flags/config file for source selection and output paths
- Publish a small API or dashboard for browsing programs/PEOs
- CI checks for linting, type checks, and data quality gates
Contributions are welcome!
- Fork the repo
- Create a feature branch: git checkout -b feat/my-change
- Commit with clear messages
- Open a Pull Request describing your change and its rationale
- Built with Dagster, BeautifulSoup4, and Pandas
- Thanks to the institutions whose publicly available pages make student research easier


