YBIO Scraper & Analysis Pipeline

A robust data pipeline to scrape, process, and analyze data from the Yearbook of International Organizations (YBIO). This tool extracts detailed information about international organizations, handles authentication via cookies, and provides analytical insights.

👤 Author

Diwas Puri
Duke University
📧 diwas.puri@duke.edu

🚀 Features

Automated Scraping: Iterates through thousands of pages on YBIO to extract organization data.
Robust Error Handling: Retries failed pages and saves data in chunks to prevent loss.
Data Cleaning: Deduplicates entries, removes artifacts, and standardizes formats.
Analysis: Jupyter notebooks for visualizing geographic distribution, founding timelines, and organization types.

📂 Project Structure

├── analysis/
│   ├── analysis.ipynb       # Visualizations and insights
│   └── data_cleanup.ipynb   # Data cleaning logic
├── data/
│   ├── organizations_clean.csv  # Final cleaned dataset
│   └── raw_chunks/          # Raw scraped data chunks
├── utils/
│   ├── merge_csv.py         # Script to merge raw chunks
│   └── analyze_html_coverage.py # Checks for missing pages
├── scrape_html_table.py     # Main scraper script
├── requirements.txt         # Python dependencies
└── README.md

🛠️ Installation

Clone the repository:

git clone https://github.com/androidilicious/ybio-scraper.git
cd ybio-scraper

Install dependencies:
```
pip install -r requirements.txt
```

🔑 Authentication

Access to YBIO requires institutional login (e.g., via Duke University). This scraper uses browser cookies for authentication.

Log in to YBIO in your web browser.
Export your cookies to a file named cookies.pkl in the root directory.
- Note: Cookie extraction scripts are provided in utils/ but are excluded from the repo for security.

💻 Usage

1. Scrape Data

Run the main scraper to fetch data from the website.

python scrape_html_table.py --workers 5

Data is saved to data/raw_chunks/.

2. Merge & Deduplicate

Combine all raw chunks into a single CSV file.

python utils/merge_csv.py

3. Clean Data

Run the cleanup notebook or script to remove duplicates and fix formatting.

Open analysis/data_cleanup.ipynb and run all cells.

4. Analyze

Explore the dataset using the analysis notebook.

Open analysis/analysis.ipynb to see charts and statistics.

📊 Dataset Overview

The final dataset includes:

Name: Organization name
Acronym: Abbreviation
Founded: Year of establishment
Location: City and Country
Type: Classification (Type I/II)

🗺️ Interactive Maps

Explore the global distribution of international organizations:

Organization Map - Interactive map with country-level markers showing organization counts (189 countries, 47K+ organizations)
Heatmap - Density heatmap of organization concentration worldwide

Click the links above to interact with the maps - zoom, pan, and click markers for details!

Disclaimer: This tool is for educational and research purposes only. Please respect the website's terms of service and crawl rate limits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

YBIO Scraper & Analysis Pipeline

👤 Author

🚀 Features

📂 Project Structure

🛠️ Installation

🔑 Authentication

💻 Usage

1. Scrape Data

2. Merge & Deduplicate

3. Clean Data

4. Analyze

📊 Dataset Overview

🗺️ Interactive Maps

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
analysis		analysis
data		data
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrape_html_table.py		scrape_html_table.py

androidilicious/ybio-scraper

Folders and files

Latest commit

History

Repository files navigation

YBIO Scraper & Analysis Pipeline

👤 Author

🚀 Features

📂 Project Structure

🛠️ Installation

🔑 Authentication

💻 Usage

1. Scrape Data

2. Merge & Deduplicate

3. Clean Data

4. Analyze

📊 Dataset Overview

🗺️ Interactive Maps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages