Skip to content

aalexmrt/github-scraper

Repository files navigation

GitHub Repository Scraper

Build

A scalable, full-stack application that analyzes GitHub repositories by extracting commit history and generating contributor leaderboards. Built with modern technologies featuring asynchronous processing, real-time status updates, and support for both public and private repositories.

Deployment: Images are built here and deployed via infrastructure repository

🎯 Overview

The GitHub Repository Scraper enables users to:

  • Analyze any GitHub repository (public or private) to identify top contributors
  • Track commit statistics and generate comprehensive leaderboards
  • Monitor processing status in real-time through an intuitive web interface
  • Access historical data with persistent storage and caching

The system uses asynchronous job processing to handle large repositories efficiently, ensuring responsive API responses while processing happens in the background.

✨ Features

Backend

  • πŸš€ Fastify HTTP Server

    • RESTful API with multiple endpoints for repository management
    • /health - Server status check
    • /leaderboard (GET) - Retrieve contributor leaderboard
    • /leaderboard (POST) - Submit repository for processing
    • /repositories - List all processed repositories
    • Dynamic state management (pending, in_progress, completed, failed)
  • ⚑ Asynchronous Processing

    • Bull queue system with Redis for job management
    • Non-blocking API responses
    • Horizontal scaling support for worker processes
  • πŸ”§ Git Operations

    • Bare repository cloning (space-efficient)
    • Incremental updates using simple-git
    • URL normalization (SSH/HTTPS support)
  • πŸ’Ύ Database Integration

    • PostgreSQL for persistent storage
    • Prisma ORM for type-safe database queries
    • Efficient caching of contributors and repositories
  • πŸ” Security & Authentication

    • GitHub Personal Access Token support for private repositories
    • Secure token handling (not stored, only used per request)
    • Comprehensive error handling for network and permission issues
  • 🌐 GitHub API Integration

    • User profile resolution and enrichment
    • Smart handling of GitHub no-reply emails
    • Rate limit awareness

Frontend

  • 🎨 Modern UI

    • Next.js 15 with React 19
    • Tailwind CSS for styling
    • Radix UI components for accessibility
    • Responsive design (desktop and mobile)
  • πŸ“Š Interactive Features

    • Repository submission form with private repo support
    • Real-time status updates (automatic polling)
    • Searchable repository table
    • Detailed contributor leaderboard display
  • βš›οΈ State Management

    • React Query for server state
    • Context API for local UI state
    • Automatic cache invalidation

πŸ—οΈ Architecture

The application follows a microservices architecture with clear separation of concerns:

Frontend (Next.js) β†’ Backend API (Fastify) β†’ Worker Process
                              ↓
                    PostgreSQL + Redis Queue
  • Frontend: User interface built with Next.js
  • Backend API: Fastify server handling HTTP requests
  • Worker: Background process for repository analysis
  • PostgreSQL: Persistent data storage
  • Redis: Job queue and caching

For detailed architecture documentation, see ARCHITECTURE.md.

πŸš€ Getting Started

Prerequisites

Installation

  1. Clone the Repository

    git clone https://github.com/aalexmrt/github-scraper
    cd github-scraper
  2. Set Up Environment Variables

    Create a .env file in the backend directory:

    cp backend/.env.example backend/.env

    Edit backend/.env with your configuration:

    # Database connection string
    DATABASE_URL=postgresql://user:password@db:5432/github_scraper
    
    # Redis connection settings
    REDIS_HOST=redis
    REDIS_PORT=6379
    
    # GitHub API Personal Access Token (optional but recommended)
    GITHUB_TOKEN=your_github_personal_access_token

    Getting a GitHub Token:

    1. Go to GitHub Developer Settings
    2. Click "Generate new token" (classic)
    3. Select scopes: read:user and repo (for private repositories)
    4. Copy the token and add it to GITHUB_TOKEN in your .env file
  3. Start Services

    Build and start all services:

    docker-compose up --build

    This starts:

    • Backend API server (port 3000)
    • Frontend web application (port 4000)
    • PostgreSQL database
    • Redis server
    • Worker process
  4. Verify Installation

    Check backend health:

    curl http://localhost:3000/health

    Expected response:

    { "message": "Server is running." }

    Access the frontend at: http://localhost:3001

Local Development (Frontend Locally, Backend in Docker)

If you prefer to run only the frontend locally while keeping the backend and services (database, Redis, worker) in Docker:

  1. Set Up Environment Variables

    Create a .env file in the project root (or set environment variables):

    # GitHub OAuth Configuration (required for authentication)
    GITHUB_CLIENT_ID=your_github_client_id
    GITHUB_CLIENT_SECRET=your_github_client_secret
    
    # Session Configuration
    SESSION_SECRET=your-super-secret-session-key-change-in-production
    
    # Application URLs
    FRONTEND_URL=http://localhost:3001
    BACKEND_URL=http://localhost:3000
    
    # GitHub Personal Access Token (optional)
    GITHUB_TOKEN=your_github_personal_access_token

    Getting GitHub OAuth Credentials:

    See OAUTH_SETUP.md for detailed instructions on setting up GitHub OAuth.

  2. Start Docker Services

    Start PostgreSQL, Redis, backend API, and worker:

    docker-compose -f docker-compose.services.yml up -d

    Or use the helper script:

    ./scripts/dev/start-services.sh

    This starts:

    • PostgreSQL database (port 5432)
    • Redis server (port 6379)
    • Backend API server (port 3000)
    • Worker process (background)
  3. Set Up Frontend Environment

    Create a .env.local file in the frontend directory:

    NEXT_PUBLIC_API_URL=http://localhost:3000
  4. Install Frontend Dependencies

    cd frontend
    pnpm install
  5. Start Frontend Server

    pnpm run dev

    The frontend will start on http://localhost:3001

  6. Verify Installation

    • Backend: curl http://localhost:3000/health
    • Frontend: Open http://localhost:3001 in your browser

Note: The backend, database, Redis, and worker all run in Docker. Only the frontend runs locally. Code changes to the backend will be reflected automatically due to volume mounting.

πŸ“– Usage

Using the Web Interface

  1. Add a Repository

    • Open http://localhost:3001
    • Enter a GitHub repository URL (e.g., https://github.com/user/repo)
    • For private repositories, check "This is a private repository" and enter your GitHub token
    • Click "Submit"
  2. Monitor Processing

    • View all repositories in the "Processed Repositories" table
    • Status badges indicate current state:
      • πŸ”΅ On Queue: Waiting for processing
      • 🟑 Processing: Currently being analyzed
      • 🟒 Completed: Successfully processed
      • πŸ”΄ Failed: Processing encountered an error
  3. View Leaderboard

    • Click the "Leaderboard" button for completed repositories
    • See contributors ranked by commit count
    • View contributor details: username, email, profile URL, and commit count

Using the API

Submit a Repository for Processing

Endpoint: POST /leaderboard

Query Parameters:

  • repoUrl (required): GitHub repository URL

Headers:

  • Authorization (optional): Bearer <token> for private repositories

Example:

curl -X POST "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"

Response (202 Accepted):

{ "message": "Repository is being processed." }

Response (200 OK - Already Completed):

{
  "message": "Repository processed successfully.",
  "lastProcessedAt": "2024-11-28T12:00:00Z"
}

Retrieve Leaderboard

Endpoint: GET /leaderboard

Query Parameters:

  • repoUrl (required): GitHub repository URL

Example:

curl "http://localhost:3000/leaderboard?repoUrl=https://github.com/aalexmrt/github-scraper"

Response:

{
  "repository": "https://github.com/aalexmrt/github-scraper",
  "top_contributors": [
    {
      "username": "aalexmrt",
      "email": "67644735+aalexmrt@users.noreply.github.com",
      "profileUrl": "https://github.com/aalexmrt",
      "commitCount": 23
    }
  ]
}

List All Repositories

Endpoint: GET /repositories

Example:

curl "http://localhost:3000/repositories"

Response:

[
  {
    "id": 1,
    "url": "https://github.com/aalexmrt/github-scraper",
    "pathName": "github-scraper",
    "state": "completed",
    "lastProcessedAt": "2024-11-28T12:00:00Z",
    "createdAt": "2024-11-28T10:00:00Z",
    "updatedAt": "2024-11-28T12:00:00Z"
  }
]

πŸ› οΈ Development

Project Structure

github-scraper/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ index.ts              # Fastify server
β”‚   β”‚   β”œβ”€β”€ services/             # Business logic
β”‚   β”‚   β”‚   β”œβ”€β”€ queueService.ts  # Bull queue setup
β”‚   β”‚   β”‚   └── repoService.ts   # Repository operations
β”‚   β”‚   β”œβ”€β”€ workers/             # Background workers
β”‚   β”‚   β”‚   └── repoWorker.ts    # Repository processing worker
β”‚   β”‚   └── utils/               # Utilities
β”‚   β”‚       β”œβ”€β”€ prisma.ts        # Prisma client
β”‚   β”‚       β”œβ”€β”€ isValidGitHubUrl.ts
β”‚   β”‚       └── normalizeUrl.ts
β”‚   └── prisma/
β”‚       └── schema.prisma        # Database schema
β”œβ”€β”€ frontend/
β”‚   └── src/
β”‚       β”œβ”€β”€ app/                 # Next.js app directory
β”‚       β”œβ”€β”€ components/         # React components
β”‚       β”œβ”€β”€ context/            # React Context providers
β”‚       β”œβ”€β”€ services/           # API services
β”‚       └── hooks/              # Custom React hooks
└── docker-compose.yml           # Docker orchestration

Running in Development Mode

The Docker setup includes hot-reload for both backend and frontend:

  • Backend: Uses nodemon to watch for TypeScript changes
  • Frontend: Uses Next.js built-in HMR (Hot Module Replacement)

Changes to code are automatically reflected without restarting containers.

Database Migrations

Prisma migrations run automatically on container startup. To create a new migration:

cd backend
npx prisma migrate dev --name migration_name

πŸ” How It Works

  1. Repository Submission: User submits a GitHub repository URL via web interface or API
  2. URL Validation: System validates and normalizes the URL (handles SSH/HTTPS formats)
  3. Job Queue: Repository is added to Redis queue for asynchronous processing
  4. Repository Sync: Worker clones or updates the repository (bare clone for efficiency)
  5. Commit Analysis: System analyzes commit history and extracts contributor information
  6. User Resolution: Contributors are resolved using GitHub API (if needed) and cached
  7. Leaderboard Generation: Commit counts are calculated and stored in database
  8. Status Updates: Frontend polls for status updates and displays results when ready

πŸ“Š Tech Stack

Backend

  • Runtime: Node.js with TypeScript
  • Framework: Fastify 5.1.0
  • Database: PostgreSQL 15 with Prisma ORM
  • Queue: Bull 4.16.4 with Redis
  • Git: simple-git 3.27.0
  • HTTP Client: Axios 1.7.7

Frontend

  • Framework: Next.js 15.0.3
  • UI Library: React 19
  • Styling: Tailwind CSS 3.4.1
  • Components: Radix UI
  • State: React Query 5.61.0, Context API
  • Forms: React Hook Form 7.53.2

Infrastructure

  • Containerization: Docker & Docker Compose
  • Database: PostgreSQL 15
  • Cache/Queue: Redis 6

πŸ› Troubleshooting

Common Issues

Issue: Backend won't start

  • Solution: Check that PostgreSQL and Redis containers are running
  • Verify DATABASE_URL in .env matches Docker Compose configuration

Issue: Repository processing fails

  • Solution: Check repository URL is valid and accessible
  • For private repos, ensure GitHub token has correct permissions
  • Check worker container logs: docker-compose logs worker

Issue: Frontend can't connect to backend

  • Solution: Verify Next.js rewrite configuration in next.config.ts
  • Ensure backend container is named app in Docker Compose

Issue: Rate limit errors from GitHub API

  • Solution: Add a GitHub Personal Access Token to .env
  • Token increases rate limit from 60 to 5000 requests/hour

🚧 Roadmap

Backend

  • Implement exponential backoff for GitHub API rate limits
  • Add automatic retry mechanism for failed repositories
  • Horizontal scaling with multiple workers
  • Redis caching for leaderboard results
  • Structured logging and monitoring

Frontend

  • WebSocket integration for real-time updates (replace polling)
  • Enhanced UI/UX improvements
  • Export leaderboard data (CSV/JSON)
  • Advanced filtering and search
  • Pagination for large datasets

πŸ“ License

This project is open source and available under the MIT License.

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸš€ Image Building & Deployment

This repository handles image building only. Deployment is handled by the infrastructure repository.

Building Images

Images are automatically built and pushed to GCP Artifact Registry when you create a git tag:

# Create a service-specific tag
git tag api-v1.2.3
git tag commit-worker-v1.5.0
git tag user-worker-v2.0.1

# Push tags to trigger builds
git push origin --tags

Tag Format

  • Format: <service>-v<version>
  • Services: api, commit-worker, user-worker
  • Version: Semantic versioning (e.g., 1.2.3)

Workflow

  1. Tag created β†’ GitHub Actions workflow triggers
  2. Image built β†’ Docker image built from Dockerfile
  3. Image pushed β†’ Pushed to Artifact Registry
  4. Deployment triggered β†’ Automatically triggers deployment in infra repo

Dockerfiles

Dockerfiles are located in:

  • backend/Dockerfile.prod - API service
  • backend/Dockerfile.cloudrun-commit-worker - Commit worker
  • backend/Dockerfile.cloudrun-user-worker - User worker

πŸ“š Additional Documentation

πŸ‘€ Author

Alex Martinez


Note: This application processes repositories asynchronously. Large repositories may take several minutes to process. The frontend automatically polls for status updates and will display results when processing completes.

About

A scalable GitHub repository scraper that analyzes commit history and generates a leaderboard of contributors.

Resources

Stars

Watchers

Forks

Packages

No packages published