Data Scrapers

A collection of specialized data scrapers for gathering training datasets from various sources.

Overview

This project contains tools to efficiently scrape and process data from popular platforms and repositories. Each scraper is designed to extract high-quality, structured data suitable for fine-tuning machine learning models.

Folders

`github_repo_scrapper/`

Scrapes code review comments from high-quality GitHub repositories. Collects review feedback and code context from merged pull requests to create a dataset for fine-tuning LLMs on code review practices.

Key Features:

Fetches merged PRs from curated Python repositories
Extracts inline review comments with code diff context
Filters low-quality comments (e.g., "LGTM", too short)
Outputs structured JSON data ready for model training
Rate-limited API calls to respect GitHub limits

Getting Started:

cd github_repo_scrapper
pip install -r requirements.txt
cp .env.example .env
# Add your GitHub token to .env
python scraper.py

Output: data/all_examples.json containing ~5-10k code review examples

Requirements

Python 3.8+
Individual scraper requirements listed in respective folders

Project Structure

DataScrappers/
├── README.md (this file)
├── github_repo_scrapper/
│   ├── README.md
│   ├── scraper.py
│   ├── requirements.txt
│   ├── .env.example
│   ├── .gitignore
│   └── data/

Future Scrapers

Documentation scrapers
Blog/article extractors
Dataset collectors from public APIs

License

See individual scraper directories for license information.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
MoE_data		MoE_data
incidents-datas-scraper		incidents-datas-scraper
repo_scrapper		repo_scrapper
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Scrapers

Overview

Folders

`github_repo_scrapper/`

Requirements

Project Structure

Future Scrapers

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Scrapers

Overview

Folders

github_repo_scrapper/

Requirements

Project Structure

Future Scrapers

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`github_repo_scrapper/`

Packages