A scalable, full-stack application that analyzes GitHub repositories by extracting commit history and generating contributor leaderboards. Built with modern technologies featuring asynchronous processing, real-time status updates, and support for both public and private repositories.
Deployment: Images are built here and deployed via infrastructure repository
The GitHub Repository Scraper enables users to:
- Analyze any GitHub repository (public or private) to identify top contributors
- Track commit statistics and generate comprehensive leaderboards
- Monitor processing status in real-time through an intuitive web interface
- Access historical data with persistent storage and caching
The system uses asynchronous job processing to handle large repositories efficiently, ensuring responsive API responses while processing happens in the background.
-
π Fastify HTTP Server
- RESTful API with multiple endpoints for repository management
/health- Server status check/leaderboard(GET) - Retrieve contributor leaderboard/leaderboard(POST) - Submit repository for processing/repositories- List all processed repositories- Dynamic state management (
pending,in_progress,completed,failed)
-
β‘ Asynchronous Processing
- Bull queue system with Redis for job management
- Non-blocking API responses
- Horizontal scaling support for worker processes
-
π§ Git Operations
- Bare repository cloning (space-efficient)
- Incremental updates using
simple-git - URL normalization (SSH/HTTPS support)
-
πΎ Database Integration
- PostgreSQL for persistent storage
- Prisma ORM for type-safe database queries
- Efficient caching of contributors and repositories
-
π Security & Authentication
- GitHub Personal Access Token support for private repositories
- Secure token handling (not stored, only used per request)
- Comprehensive error handling for network and permission issues
-
π GitHub API Integration
- User profile resolution and enrichment
- Smart handling of GitHub no-reply emails
- Rate limit awareness
-
π¨ Modern UI
- Next.js 15 with React 19
- Tailwind CSS for styling
- Radix UI components for accessibility
- Responsive design (desktop and mobile)
-
π Interactive Features
- Repository submission form with private repo support
- Real-time status updates (automatic polling)
- Searchable repository table
- Detailed contributor leaderboard display
-
βοΈ State Management
- React Query for server state
- Context API for local UI state
- Automatic cache invalidation
The application follows a microservices architecture with clear separation of concerns:
Frontend (Next.js) β Backend API (Fastify) β Worker Process
β
PostgreSQL + Redis Queue
- Frontend: User interface built with Next.js
- Backend API: Fastify server handling HTTP requests
- Worker: Background process for repository analysis
- PostgreSQL: Persistent data storage
- Redis: Job queue and caching
For detailed architecture documentation, see ARCHITECTURE.md.
- Docker and Docker Compose
- Git
- GitHub Personal Access Token (optional, for private repositories)
-
Clone the Repository
git clone https://github.com/aalexmrt/github-scraper cd github-scraper -
Set Up Environment Variables
Create a
.envfile in thebackenddirectory:cp backend/.env.example backend/.env
Edit
backend/.envwith your configuration:# Database connection string DATABASE_URL=postgresql://user:password@db:5432/github_scraper # Redis connection settings REDIS_HOST=redis REDIS_PORT=6379 # GitHub API Personal Access Token (optional but recommended) GITHUB_TOKEN=your_github_personal_access_token
Getting a GitHub Token:
- Go to GitHub Developer Settings
- Click "Generate new token" (classic)
- Select scopes:
read:userandrepo(for private repositories) - Copy the token and add it to
GITHUB_TOKENin your.envfile
-
Start Services
Build and start all services:
docker-compose up --build
This starts:
- Backend API server (port 3000)
- Frontend web application (port 4000)
- PostgreSQL database
- Redis server
- Worker process
-
Verify Installation
Check backend health:
curl http://localhost:3000/health
Expected response:
{ "message": "Server is running." }Access the frontend at:
http://localhost:3001
If you prefer to run only the frontend locally while keeping the backend and services (database, Redis, worker) in Docker:
-
Set Up Environment Variables
Create a
.envfile in the project root (or set environment variables):# GitHub OAuth Configuration (required for authentication) GITHUB_CLIENT_ID=your_github_client_id GITHUB_CLIENT_SECRET=your_github_client_secret # Session Configuration SESSION_SECRET=your-super-secret-session-key-change-in-production # Application URLs FRONTEND_URL=http://localhost:3001 BACKEND_URL=http://localhost:3000 # GitHub Personal Access Token (optional) GITHUB_TOKEN=your_github_personal_access_token
Getting GitHub OAuth Credentials:
See OAUTH_SETUP.md for detailed instructions on setting up GitHub OAuth.
-
Start Docker Services
Start PostgreSQL, Redis, backend API, and worker:
docker-compose -f docker-compose.services.yml up -d
Or use the helper script:
./scripts/dev/start-services.sh
This starts:
- PostgreSQL database (port 5432)
- Redis server (port 6379)
- Backend API server (port 3000)
- Worker process (background)
-
Set Up Frontend Environment
Create a
.env.localfile in thefrontenddirectory:NEXT_PUBLIC_API_URL=http://localhost:3000
-
Install Frontend Dependencies
cd frontend pnpm install -
Start Frontend Server
pnpm run dev
The frontend will start on
http://localhost:3001 -
Verify Installation
- Backend:
curl http://localhost:3000/health - Frontend: Open
http://localhost:3001in your browser
- Backend:
Note: The backend, database, Redis, and worker all run in Docker. Only the frontend runs locally. Code changes to the backend will be reflected automatically due to volume mounting.
-
Add a Repository
- Open
http://localhost:3001 - Enter a GitHub repository URL (e.g.,
https://github.com/user/repo) - For private repositories, check "This is a private repository" and enter your GitHub token
- Click "Submit"
- Open
-
Monitor Processing
- View all repositories in the "Processed Repositories" table
- Status badges indicate current state:
- π΅ On Queue: Waiting for processing
- π‘ Processing: Currently being analyzed
- π’ Completed: Successfully processed
- π΄ Failed: Processing encountered an error
-
View Leaderboard
- Click the "Leaderboard" button for completed repositories
- See contributors ranked by commit count
- View contributor details: username, email, profile URL, and commit count
Endpoint: POST /leaderboard
Query Parameters:
repoUrl(required): GitHub repository URL
Headers:
Authorization(optional):Bearer <token>for private repositories
Example:
curl -X POST "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"Response (202 Accepted):
{ "message": "Repository is being processed." }Response (200 OK - Already Completed):
{
"message": "Repository processed successfully.",
"lastProcessedAt": "2024-11-28T12:00:00Z"
}Endpoint: GET /leaderboard
Query Parameters:
repoUrl(required): GitHub repository URL
Example:
curl "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"Response:
{
"repository": "https://github.com/aalexmrt/github-scraper",
"top_contributors": [
{
"username": "aalexmrt",
"email": "67644735+aalexmrt@users.noreply.github.com",
"profileUrl": "https://github.com/aalexmrt",
"commitCount": 23
}
]
}Endpoint: GET /repositories
Example:
curl "http://localhost:3000/repositories"Response:
[
{
"id": 1,
"url": "https://github.com/aalexmrt/github-scraper",
"pathName": "github-scraper",
"state": "completed",
"lastProcessedAt": "2024-11-28T12:00:00Z",
"createdAt": "2024-11-28T10:00:00Z",
"updatedAt": "2024-11-28T12:00:00Z"
}
]github-scraper/
βββ backend/
β βββ src/
β β βββ index.ts # Fastify server
β β βββ services/ # Business logic
β β β βββ queueService.ts # Bull queue setup
β β β βββ repoService.ts # Repository operations
β β βββ workers/ # Background workers
β β β βββ repoWorker.ts # Repository processing worker
β β βββ utils/ # Utilities
β β βββ prisma.ts # Prisma client
β β βββ isValidGitHubUrl.ts
β β βββ normalizeUrl.ts
β βββ prisma/
β βββ schema.prisma # Database schema
βββ frontend/
β βββ src/
β βββ app/ # Next.js app directory
β βββ components/ # React components
β βββ context/ # React Context providers
β βββ services/ # API services
β βββ hooks/ # Custom React hooks
βββ docker-compose.yml # Docker orchestration
The Docker setup includes hot-reload for both backend and frontend:
- Backend: Uses
nodemonto watch for TypeScript changes - Frontend: Uses Next.js built-in HMR (Hot Module Replacement)
Changes to code are automatically reflected without restarting containers.
Prisma migrations run automatically on container startup. To create a new migration:
cd backend
npx prisma migrate dev --name migration_name- Repository Submission: User submits a GitHub repository URL via web interface or API
- URL Validation: System validates and normalizes the URL (handles SSH/HTTPS formats)
- Job Queue: Repository is added to Redis queue for asynchronous processing
- Repository Sync: Worker clones or updates the repository (bare clone for efficiency)
- Commit Analysis: System analyzes commit history and extracts contributor information
- User Resolution: Contributors are resolved using GitHub API (if needed) and cached
- Leaderboard Generation: Commit counts are calculated and stored in database
- Status Updates: Frontend polls for status updates and displays results when ready
- Runtime: Node.js with TypeScript
- Framework: Fastify 5.1.0
- Database: PostgreSQL 15 with Prisma ORM
- Queue: Bull 4.16.4 with Redis
- Git: simple-git 3.27.0
- HTTP Client: Axios 1.7.7
- Framework: Next.js 15.0.3
- UI Library: React 19
- Styling: Tailwind CSS 3.4.1
- Components: Radix UI
- State: React Query 5.61.0, Context API
- Forms: React Hook Form 7.53.2
- Containerization: Docker & Docker Compose
- Database: PostgreSQL 15
- Cache/Queue: Redis 6
Issue: Backend won't start
- Solution: Check that PostgreSQL and Redis containers are running
- Verify
DATABASE_URLin.envmatches Docker Compose configuration
Issue: Repository processing fails
- Solution: Check repository URL is valid and accessible
- For private repos, ensure GitHub token has correct permissions
- Check worker container logs:
docker-compose logs worker
Issue: Frontend can't connect to backend
- Solution: Verify Next.js rewrite configuration in
next.config.ts - Ensure backend container is named
appin Docker Compose
Issue: Rate limit errors from GitHub API
- Solution: Add a GitHub Personal Access Token to
.env - Token increases rate limit from 60 to 5000 requests/hour
- Implement exponential backoff for GitHub API rate limits
- Add automatic retry mechanism for failed repositories
- Horizontal scaling with multiple workers
- Redis caching for leaderboard results
- Structured logging and monitoring
- WebSocket integration for real-time updates (replace polling)
- Enhanced UI/UX improvements
- Export leaderboard data (CSV/JSON)
- Advanced filtering and search
- Pagination for large datasets
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit a Pull Request.
This repository handles image building only. Deployment is handled by the infrastructure repository.
Images are automatically built and pushed to GCP Artifact Registry when you create a git tag:
# Create a service-specific tag
git tag api-v1.2.3
git tag commit-worker-v1.5.0
git tag user-worker-v2.0.1
# Push tags to trigger builds
git push origin --tags- Format:
<service>-v<version> - Services:
api,commit-worker,user-worker - Version: Semantic versioning (e.g.,
1.2.3)
- Tag created β GitHub Actions workflow triggers
- Image built β Docker image built from Dockerfile
- Image pushed β Pushed to Artifact Registry
- Deployment triggered β Automatically triggers deployment in infra repo
Dockerfiles are located in:
backend/Dockerfile.prod- API servicebackend/Dockerfile.cloudrun-commit-worker- Commit workerbackend/Dockerfile.cloudrun-user-worker- User worker
- ARCHITECTURE.md - Detailed architecture and design patterns
- API Documentation - Complete API reference
- Infrastructure Repository - Deployment configurations and scripts
Alex Martinez
- GitHub: @aalexmrt
Note: This application processes repositories asynchronously. Large repositories may take several minutes to process. The frontend automatically polls for status updates and will display results when processing completes.