Collect, analyze, and visualize GitHub repository data using 3 specialized databases and interactive dashboards.
RepoBox collects GitHub repositories and stores them in 3 different databases:
- Memgraph (Graph DB) - For exploring connections between repos, users, and languages
- MongoDB (Document DB) - For storing complete repository data
- Dragonfly (Cache) - For ultra-fast data access
Then it provides:
- FastAPI - REST API to access the data
- Apache Superset - Beautiful dashboards and charts
- Analyze repositories across multiple programming languages
- Explore developer networks and connections
- Visualize trends with interactive dashboards
- Lightning-fast queries with smart caching
- Graph relationships between repos, users, topics, and frameworks
You only need 2 things:
- Docker (with Docker Compose)
- GitHub Token (free, takes 1 minute to create)
- Go to https://github.com/settings/tokens
- Click "Generate new token (classic)"
- Give it a name like "RepoBox"
- Select scopes:
public_repoandread:user - Click "Generate token"
- Copy the token (starts with
ghp_...)
# Clone the repository
git clone https://github.com/YOUR_USERNAME/repobox.git
cd repobox
# Edit .env and paste your GitHub token
nano .env # or use any text editorEdit the .env file and update your token:
GITHUB_TOKEN=ghp_YOUR_TOKEN_HERE
REPOS_PER_LANGUAGE=10
LANGUAGES=python,javascript,typescript,go,rust,java# Start all services (databases, API, dashboards)
docker compose up -d
# Wait 30 seconds for everything to start
# Check if everything is running
docker compose psYou should see 6 services running:
- repo-memgraph
- repo-mongodb
- repo-dragonfly
- github-backend
- repobox-superset
- repo-postgres
# Install Python dependencies
pip install -r collector/requirements.txt
# Initialize databases
cd collector
python init_databases.py
# Collect GitHub repositories (takes ~2 minutes)
python collect_repos.py
# Aggregate the data
python aggregate_data.pyThat's it!
| Service | URL | Login |
|---|---|---|
| Superset Dashboards | http://localhost:8088 | admin / admin |
| FastAPI Docs | http://localhost:5000/docs | No login |
| Memgraph Lab | http://localhost:3002 | No login |
| MongoDB UI | http://localhost:8083 | No login |
For each programming language (Python, JavaScript, etc.), the system collects:
- Top repositories by stars
- Repository details (name, description, stars, forks)
- Owner information (users and organizations)
- Programming languages used
- Topics and frameworks
- Dependencies
Example: If you set REPOS_PER_LANGUAGE=10 and LANGUAGES=python,javascript, you'll get 20 repositories total.
- Open http://localhost:8088
- Login:
admin/admin - Create charts and dashboards
- Connect to PostgreSQL or MongoDB
# Get language statistics
curl http://localhost:5000/metrics/languages
# Get location data
curl http://localhost:5000/metrics/locations/map
# Check cache stats
curl http://localhost:5000/cache/stats- Open http://localhost:3002 (Memgraph Lab)
- Run Cypher queries:
// Find all Python repositories
MATCH (r:Repository)-[:USES]->(l:Language {name: 'Python'})
RETURN r.name, r.stars
ORDER BY r.stars DESC
LIMIT 10
// Find repos using Django framework
MATCH (r:Repository)-[:USES_FRAMEWORK]->(f:Framework {name: 'Django'})
RETURN r.name, r.stars
// Find most popular topics
MATCH (r:Repository)-[:HAS_TOPIC]->(t:Topic)
RETURN t.name, count(r) as repos
ORDER BY repos DESC
LIMIT 10Edit .env to customize:
# How many repos per language?
REPOS_PER_LANGUAGE=10
# Which languages to collect?
LANGUAGES=python,javascript,typescript,go,rust,java,cpp,csharp
# Filter by country (optional)
FILTER_BY_COUNTRY=Tunisia
# Leave empty for global collection# Stop all services
docker compose down
# Stop and delete all data (fresh start)
docker compose down -vrepobox/
├── docker-compose.yaml # All services configuration
├── .env # Your settings (GitHub token, etc.)
│
├── collector/ # Data collection scripts
│ ├── collect_repos.py # Main collector
│ ├── aggregate_data.py # Data aggregation
│ ├── init_databases.py # Database setup
│ └── requirements.txt # Python dependencies
│
├── backend/ # FastAPI REST API
│ ├── api.py # API endpoints
│ └── requirements.txt # API dependencies
│
└── superset/ # Superset configuration
├── Dockerfile
└── superset_config.py
# Check Docker is running
docker --version
# View logs
docker compose logs -f# Check your GitHub token is set
cat .env | grep GITHUB_TOKEN
# Test database connections
cd collector
python test_connections.pyEdit docker-compose.yaml and change the port numbers:
ports:
- "5001:5000" # Change 5000 to 5001- Memgraph - Graph database with nodes and relationships
- MongoDB - Document database with flexible JSON storage
- Dragonfly - Ultra-fast cache (25x faster than Redis)
- PostgreSQL - SQL database for Superset
/- API status/metrics/languages- Language statistics/metrics/locations/map- World map data/locations/{location}/repos- Repos by location/cache/stats- Cache performance/cache/clear- Clear cache
- 9 node types: Repository, User, Organization, Language, Framework, Topic, Dependency, Contributor, City
- 7 relationship types: OWNED_BY, USES, HAS_TOPIC, DEPENDS_ON, USES_FRAMEWORK, CONTRIBUTES_TO, HAS_CONTRIBUTOR
- Create custom Superset dashboards
- Add more programming languages
- Collect user profiles with
collect_user_profile.py - Explore graph relationships in Memgraph Lab
- Build your own API endpoints
MIT License - Feel free to use and modify!
Found a bug? Have an idea? Open an issue or submit a pull request!
Built using Docker, FastAPI, Memgraph, MongoDB, Dragonfly, and Apache Superset