Skip to content

ArshTiwari2004/go-text-search-engine

Repository files navigation

GOSEARCH Logo

Go Version License GitHub release Build Status

GoSearch is a fast full-text search engine built from scratch in Go. It indexes documents using an inverted index and ranks search results with TF-IDF, providing relevant results in milliseconds. Designed for speed, it supports concurrent indexing, persistent storage for quick startup, and a RESTful API making it an efficient, cost-free alternative to heavier search systems like Elasticsearch for datasets under 10 million documents.

For testing, I used the simplewiki-latest-pages-articles.xml.bz2 dump file from: https://dumps.wikimedia.org/simplewki/latest/. You can download any Wikipedia dump from there and use it for testing. If you're using a different dump file, update the default path in main.go:

flag.StringVar(&dumpPath, "dump", "simplewiki-latest-pages-articles.xml.bz2",
    "Path to Wikipedia dump file")

or pass your own dump file path using the -dump flag when running the program:

go run cmd/main.go -dump pathtoyour-dump-file.xml.bz2

Installation

Prerequisites

  • Go 1.21+ (Download)
  • Git
  • 4GB+ RAM (for indexing large documents otherwise it will hang)

Quick Start

# Clone the repository
git clone https://github.com/ArshTiwari2004/go-text-search-engine.git
cd gosearch

# Download dependencies
go mod download

# Download Wikipedia dump (optional as you can use your own data)
wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

# Build the project
go build -o gosearch ./cmd/api

# Run the server
./gosearch -dump enwiki-latest-abstract1.xml.gz -port 8080

Setup the backend, run the go server

git clone https://github.com/ArshTiwari2004/go-text-search-engine.git
cd gosearch
cd cmd/api

# Place dataset here:
# simplewiki-latest-pages-articles.xml.bz2

go run main.go

API will start at:
http://localhost:8080 , and you will see a message Starting GoSearch API Server

but if you want full system, redis and postgres included, then run this to setup the backend:

go run main.go \
  -redis localhost:6379 \
  -dbdsn "postgres://gosearch:gosearch@localhost:5432/gosearch?sslmode=disable"

Setup the frontend, this is optional

cd frontend

# Install dependencies
npm install

# Start development server
npm start

# Frontend will open at http://localhost:5173

Usage

Command Line

# First run, builds index
./gosearch -dump wiki-dump.xml.gz

# Subsequent runs, it loads from disk
./gosearch

# Force rebuild
./gosearch -rebuild

Programmatic Usage

package main

import (
    "github.com/ArshTiwari2004/gosearch/internal/engine"
)

func main() {
    // Create engine
    eng := engine.NewEngine()
    
    // Load documents
    docs, _ := engine.LoadDocuments("dump.xml.gz")
    
    // Build index
    eng.IndexDocuments(docs)
    
    // Search
    results, _ := eng.Search("golang concurrency", 10)
    
    for _, result := range results {
        fmt.Printf("%s (score: %.3f)\n", result.Document.Title, result.Score)
    }
}

Modern applications require search functionality, but existing solutions have limitations:

Solution Problem
Elasticsearch Expensive ($$$), complex setup, overkill for <10M docs
Algolia Vendor lock-in, expensive at scale ($2K+/month)
Built-in SQL LIKE Doesn't scale beyond 100K records, no relevance ranking
strings.Contains() O(n) per search, no ranking, impractical for large datasets

Configuration Options

These are the configuration options provided in the code in the main.go file. Will increase these by time.

These are the command-Line flags used :

./gosearch [options]

Options:
  -dump string
        Path to Wikipedia XML dump file
        (default "enwiki-latest-stub-articles.xml.gz")
  
  -data string
        Directory for index persistence
        (default "./data")
  
  -port string
        HTTP server port
        (default "8080")
  
  -rebuild
        Force rebuild index from dump (ignores persisted index)
        (default false)

Examples

# Use custom dump file
./gosearch -dump my-documents.xml.gz

# Use different port
./gosearch -port 3000

# Force rebuild (useful after code changes)
./gosearch -rebuild

# All options combined
./gosearch -dump data.xml.gz -port 9000 -data /var/lib/gosearch -rebuild

Features available in Gosearch:

Core Search Engine

  • Inverted Index - Maps terms to documents for fast lookups
  • TF-IDF Ranking - Relevance scoring based on term frequency and inverse document frequency
  • Text Analysis Pipeline
    • Tokenization (split on word boundaries)
    • Lowercasing (case-insensitive search)
    • Stopword removal (filter common words)
    • Snowball stemming (reduce to root forms)
  • Boolean AND Queries - Find documents containing all query terms
  • Ranked Results - Sort by relevance score

Performance Optimizations

  • Concurrent Indexing - Worker pool pattern for parallel processing
  • Persistent Storage - Save/load index to avoid rebuild (85% startup time reduction)
  • Memory Efficiency - Optimized data structures
  • Posting List Intersection - Efficient merge algorithm (O(n+m))

API & Integration

  • RESTful API - JSON endpoints with Gin framework
  • CORS Support - Enable frontend integration
  • Statistics Endpoint - Real-time performance metrics
  • Health Checks - Monitoring and alerting support
  • Documentation - OpenAPI/Swagger compatible

Developer Experience

  • Clean Architecture - Separation of concerns
  • Comprehensive Comments - Documented each function cleanly
  • Error Handling - Proper error propagation
  • Type Safety - Strongly typed throughout

Multi-Language Support

GoSearch is built in Go, but it is not limited to Go applications.. Because GoSearch exposes a RESTful HTTP API with JSON responses, it can be used from any programming language that supports HTTP requests.

This makes GoSearch fully language agnostic, similar to how Elasticsearch works internally.

GoSearch runs as an HTTP server:

GoSearch Engine (Go)
↓
Gin HTTP ServerREST API (JSON)

Any language capable of sending HTTP requests can integrate with it.

Python example

import requests

response = requests.post(
    "http://localhost:8080/api/v1/search",
    json={
        "query": "golang concurrency",
        "max_results": 5
    }
)

print(response.json())

NodeJS example

const axios = require("axios");

async function search() {
  const response = await axios.post(
    "http://localhost:8080/api/v1/search",
    {
      query: "golang concurrency",
      max_results: 5
    }
  );

  console.log(response.data);
}

search();

Java example

// Using Java 11+ HttpClient
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("http://localhost:8080/api/v1/search"))
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(
        "{\"query\":\"golang\",\"max_results\":5}"
    ))
    .build();

In the future, official client SDKs may be provided for: gosearch-go, gosearch-py, gosearch-js.

These SDKs would simply wrap the REST API to provide a cleaner developer experience.

High-Level System Architecture

  • will update this
graph TB
    subgraph "Client Layer"
        A[Web Browser]
        B[Mobile App]
        C[CLI Tool]
    end
    
    subgraph "Frontend Layer"
        D[React UI Search Interface]
    end
    
    subgraph "API Gateway"
        E[Nginx]
    end
    
    subgraph "Application Layer"
        F[REST API ServerGin Framework]
        G[Search Engine Core]
        H[Text Analyzer]
        I[Ranking Engine TF-IDF]
    end
    
    subgraph "Data Layer"
        J[Inverted index in memory]
        K[Document store in memory]
        L[Persistence layerGob files]
    end
    
    subgraph "External Data"
        M[Wikipedia DumpXML.GZ]
    end
    
    A & B & C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    G --> I
    G --> J
    G --> K
    J -.persist.-> L
    K -.persist.-> L
    M -.load.-> G
    
    style A fill:#e1f5ff
    style D fill:#fff4e1
    style F fill:#e8f5e9
    style G fill:#fce4ec
    style J fill:#f3e5f5
    style L fill:#e0f2f1
Loading

While performing testing with simplewiki-latest-pages-articles.xml.bz2 dump file, the search engine had 1,000 documents( limit set intentionally ) indexed with 52,366 unique terms, when searching for "go programming", the frontend sends a POST request to /api/v1/search, and the Go backend performs TF-IDF ranking to return the top results. The query returned 20 results in 1.777208 ms (~1.7 ms), demonstrating very fast processing and low-latency performance.

API Documentation

Base URL

http://localhost:8080/api/v1

Endpoints

1. Search (POST)

Endpoint: POST /api/v1/search

Request:

{
  "query": "golang concurrency patterns",
  "max_results": 10,
  "min_score": 0.5
}

Response:

{
  "query": "golang concurrency patterns",
  "results": [
    {
      "document": {
        "id": 12345,
        "title": "Go Concurrency Patterns",
        "url": "https://...",
        "text": "...",
        "word_count": 500
      },
      "score": 8.45,
      "snippets": ["...concurrency patterns in Go..."],
      "rank": 1
    }
  ],
  "total_results": 15,
  "time_taken": "23.5ms",
  "success": true
}

2. Search (GET)

Endpoint: GET /api/v1/search?q=golang&limit=10

Response: Same as POST

3. Get Document

Endpoint: GET /api/v1/document/:id

Response:

{
  "document": {
    "id": 12345,
    "title": "Document Title",
    "text": "Full document text...",
    "url": "https://..."
  },
  "success": true
}

4. Statistics

Endpoint: GET /api/v1/stats

Response:

{
  "total_documents": 600000,
  "total_terms": 2500000,
  "total_queries": 15234,
  "average_query_time": "45.2ms",
  "memory_usage_mb": 450.3,
  "index_size_kb": 102400,
  "uptime": "5h23m"
}

5. Health Check

Endpoint: GET /health

Response:

{
  "status": "healthy",
  "documents": 600000,
  "terms": 2500000,
  "queries": 15234,
  "timestamp": 1640000000
}

Concurrent Indexing Flow

Concurrent indexing uses a worker pool pattern where N workers (based on CPU cores) process documents in parallel, build local indices without locks, then merge at the end for a 1.9x speedup.

A worker pool is a concurrency pattern where:

  • we create N worker goroutines
  • send them tasks through a channel
  • each worker picks up tasks and processes them
  • wait until all workers finish Instead of doing work one by one, we divide it across multiple workers running in parallel.

In my code this "task" is the documents to index.

// Create worker pool
workers := runtime.NumCPU()
docsChan := make(chan Document, workers)

// Start workers
for i := 0; i < workers; i++ {
    go func() {
        for doc := range docsChan {
            index.AddDocument(doc)  // Process in parallel
        }
    }()
}

// Send work
for _, doc := range documents {
    docsChan <- doc
}

Concurrency Model

I have used two types of concurrency:

1. Worker Pool fo parallel indexing

Indexing uses a worker pool pattern to process documents concurrently.

for i := 0; i < numWorkers; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        for doc := range docsChan {
            e.index.AddDocument(doc)
        }
    }()
}

It works in this way: Starts numWorkers goroutines Each worker: -> Waits for documents from docsChan -> Calls AddDocument(doc) -> Runs until the channel is closed

Documents are distributed using:

for _, doc := range docs {
    docsChan <- doc
}
close(docsChan)

The main goroutine waits for completion:

wg.Wait()

Why I used worker pool in this project: Indexing is CPU-intensive, Each document is independent, Utilizes multiple CPU cores efficiently, Prevents spawning unlimited goroutines.

2. Mutex for thread safety

The engine uses sync.RWMutex to prevent race conditions.

e.mu.Lock() → Used during indexing (write operation)
e.mu.RLock() → Used during search (read operation)

This ensures that -> safe concurrent searches -> no modification of the index while indexing is running.

GoSearch Indexing

Search query flow:

The query is processed through the Analyzer (tokenization, normalization, stemming), relevant postings are retrieved from the inverted index, TF-IDF scores are computed for matching documents, results are ranked in descending order, and the top-k documents are returned as a structured JSON response.

TF-IDF example on 3 documents

TF-IDF Ranking Algorithm

GoSearch uses the TF-IDF (Term Frequency - Inverse Document Frequency) model for keyword-based relevance ranking.

type Posting struct {
    DocID         int
    TermFrequency int  // Used for TF
    DocLength     int  // Used for normalization
}

Term Frequency (TF)

Measures how often a term appears in a document.
Higher frequency ⇒ higher importance within that document.

Inverse Document Frequency (IDF)

Measures how rare a term is across the corpus.
Rare terms receive higher weight, while common terms (e.g., stop words) receive lower weight.

Scoring Formula goes as:

score(term, document) = TF × IDF

The key insights were:

  • Terms appearing in all documents receive very low (or zero) IDF weight.
  • Rare, discriminative terms contribute more to ranking.
  • Documents containing more query-specific terms rank higher.

In the example below (3 small documents), the word "good" appears in all documents and therefore gets minimal ranking weight, while "boy" and "girl" provide stronger ranking signals.

TF-IDF example on 3 documents

View the deployed project here


Thanks for reading till here! I’ll continue updating this README with more technical details and deployment steps soon.

About

GoSearch is a lightweight, concurrent full-text search engine built from scratch in golang using gin web framework that indexes large documents and returns relevance-ranked results

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors