GitHub - ArshTiwari2004/go-text-search-engine: GoSearch is a lightweight, concurrent full-text search engine built from scratch in golang using gin web framework that indexes large documents and returns relevance-ranked results

GoSearch is a fast full-text search engine built from scratch in Go. It indexes documents using an inverted index and ranks search results with TF-IDF, providing relevant results in milliseconds. Designed for speed, it supports concurrent indexing, persistent storage for quick startup, and a RESTful API making it an efficient, cost-free alternative to heavier search systems like Elasticsearch for datasets under 10 million documents.

For testing, I used the simplewiki-latest-pages-articles.xml.bz2 dump file from: https://dumps.wikimedia.org/simplewki/latest/. You can download any Wikipedia dump from there and use it for testing. If you're using a different dump file, update the default path in main.go:

flag.StringVar(&dumpPath, "dump", "simplewiki-latest-pages-articles.xml.bz2",
    "Path to Wikipedia dump file")

or pass your own dump file path using the -dump flag when running the program:

go run cmd/main.go -dump pathtoyour-dump-file.xml.bz2

Installation

Prerequisites

Go 1.21+ (Download)
Git
4GB+ RAM (for indexing large documents otherwise it will hang)

Quick Start

# Clone the repository
git clone https://github.com/ArshTiwari2004/go-text-search-engine.git
cd gosearch

# Download dependencies
go mod download

# Download Wikipedia dump (optional as you can use your own data)
wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

# Build the project
go build -o gosearch ./cmd/api

# Run the server
./gosearch -dump enwiki-latest-abstract1.xml.gz -port 8080

Setup the backend, run the go server

git clone https://github.com/ArshTiwari2004/go-text-search-engine.git
cd gosearch
cd cmd/api

# Place dataset here:
# simplewiki-latest-pages-articles.xml.bz2

go run main.go

API will start at:
http://localhost:8080 , and you will see a message Starting GoSearch API Server

but if you want full system, redis and postgres included, then run this to setup the backend:

go run main.go \
  -redis localhost:6379 \
  -dbdsn "postgres://gosearch:gosearch@localhost:5432/gosearch?sslmode=disable"

Setup the frontend, this is optional

cd frontend

# Install dependencies
npm install

# Start development server
npm start

# Frontend will open at http://localhost:5173

Usage

Command Line

# First run, builds index
./gosearch -dump wiki-dump.xml.gz

# Subsequent runs, it loads from disk
./gosearch

# Force rebuild
./gosearch -rebuild

Programmatic Usage

package main

import (
    "github.com/ArshTiwari2004/gosearch/internal/engine"
)

func main() {
    // Create engine
    eng := engine.NewEngine()
    
    // Load documents
    docs, _ := engine.LoadDocuments("dump.xml.gz")
    
    // Build index
    eng.IndexDocuments(docs)
    
    // Search
    results, _ := eng.Search("golang concurrency", 10)
    
    for _, result := range results {
        fmt.Printf("%s (score: %.3f)\n", result.Document.Title, result.Score)
    }
}

Modern applications require search functionality, but existing solutions have limitations:

Solution	Problem
Elasticsearch	Expensive ($$$), complex setup, overkill for <10M docs
Algolia	Vendor lock-in, expensive at scale ($2K+/month)
Built-in SQL LIKE	Doesn't scale beyond 100K records, no relevance ranking
strings.Contains()	O(n) per search, no ranking, impractical for large datasets

Configuration Options

These are the configuration options provided in the code in the main.go file. Will increase these by time.

These are the command-Line flags used :

./gosearch [options]

Options:
  -dump string
        Path to Wikipedia XML dump file
        (default "enwiki-latest-stub-articles.xml.gz")
  
  -data string
        Directory for index persistence
        (default "./data")
  
  -port string
        HTTP server port
        (default "8080")
  
  -rebuild
        Force rebuild index from dump (ignores persisted index)
        (default false)

Examples

# Use custom dump file
./gosearch -dump my-documents.xml.gz

# Use different port
./gosearch -port 3000

# Force rebuild (useful after code changes)
./gosearch -rebuild

# All options combined
./gosearch -dump data.xml.gz -port 9000 -data /var/lib/gosearch -rebuild

Features available in Gosearch:

Core Search Engine

Inverted Index - Maps terms to documents for fast lookups
TF-IDF Ranking - Relevance scoring based on term frequency and inverse document frequency
Text Analysis Pipeline
- Tokenization (split on word boundaries)
- Lowercasing (case-insensitive search)
- Stopword removal (filter common words)
- Snowball stemming (reduce to root forms)
Boolean AND Queries - Find documents containing all query terms
Ranked Results - Sort by relevance score

Performance Optimizations

Concurrent Indexing - Worker pool pattern for parallel processing
Persistent Storage - Save/load index to avoid rebuild (85% startup time reduction)
Memory Efficiency - Optimized data structures
Posting List Intersection - Efficient merge algorithm (O(n+m))

API & Integration

RESTful API - JSON endpoints with Gin framework
CORS Support - Enable frontend integration
Statistics Endpoint - Real-time performance metrics
Health Checks - Monitoring and alerting support
Documentation - OpenAPI/Swagger compatible

Developer Experience

Clean Architecture - Separation of concerns
Comprehensive Comments - Documented each function cleanly
Error Handling - Proper error propagation
Type Safety - Strongly typed throughout

Multi-Language Support

GoSearch is built in Go, but it is not limited to Go applications.. Because GoSearch exposes a RESTful HTTP API with JSON responses, it can be used from any programming language that supports HTTP requests.

This makes GoSearch fully language agnostic, similar to how Elasticsearch works internally.

GoSearch runs as an HTTP server:

GoSearch Engine (Go)
↓
Gin HTTP Server
↓
REST API (JSON)

Any language capable of sending HTTP requests can integrate with it.

Python example

import requests

response = requests.post(
    "http://localhost:8080/api/v1/search",
    json={
        "query": "golang concurrency",
        "max_results": 5
    }
)

print(response.json())

NodeJS example

const axios = require("axios");

async function search() {
  const response = await axios.post(
    "http://localhost:8080/api/v1/search",
    {
      query: "golang concurrency",
      max_results: 5
    }
  );

  console.log(response.data);
}

search();

Java example

// Using Java 11+ HttpClient
HttpRequest request = HttpRequest.newBuilder()
    .uri(URI.create("http://localhost:8080/api/v1/search"))
    .header("Content-Type", "application/json")
    .POST(HttpRequest.BodyPublishers.ofString(
        "{\"query\":\"golang\",\"max_results\":5}"
    ))
    .build();

In the future, official client SDKs may be provided for: gosearch-go, gosearch-py, gosearch-js.

These SDKs would simply wrap the REST API to provide a cleaner developer experience.

High-Level System Architecture

will update this

graph TB
    subgraph "Client Layer"
        A[Web Browser]
        B[Mobile App]
        C[CLI Tool]
    end
    
    subgraph "Frontend Layer"
        D[React UI Search Interface]
    end
    
    subgraph "API Gateway"
        E[Nginx]
    end
    
    subgraph "Application Layer"
        F[REST API ServerGin Framework]
        G[Search Engine Core]
        H[Text Analyzer]
        I[Ranking Engine TF-IDF]
    end
    
    subgraph "Data Layer"
        J[Inverted index in memory]
        K[Document store in memory]
        L[Persistence layerGob files]
    end
    
    subgraph "External Data"
        M[Wikipedia DumpXML.GZ]
    end
    
    A & B & C --> D
    D --> E
    E --> F
    F --> G
    G --> H
    G --> I
    G --> J
    G --> K
    J -.persist.-> L
    K -.persist.-> L
    M -.load.-> G
    
    style A fill:#e1f5ff
    style D fill:#fff4e1
    style F fill:#e8f5e9
    style G fill:#fce4ec
    style J fill:#f3e5f5
    style L fill:#e0f2f1

While performing testing with simplewiki-latest-pages-articles.xml.bz2 dump file, the search engine had 1,000 documents( limit set intentionally ) indexed with 52,366 unique terms, when searching for "go programming", the frontend sends a POST request to /api/v1/search, and the Go backend performs TF-IDF ranking to return the top results. The query returned 20 results in 1.777208 ms (~1.7 ms), demonstrating very fast processing and low-latency performance.

API Documentation

Base URL

http://localhost:8080/api/v1

Endpoints

1. Search (POST)

Endpoint: POST /api/v1/search

Request:

{
  "query": "golang concurrency patterns",
  "max_results": 10,
  "min_score": 0.5
}

Response:

{
  "query": "golang concurrency patterns",
  "results": [
    {
      "document": {
        "id": 12345,
        "title": "Go Concurrency Patterns",
        "url": "https://...",
        "text": "...",
        "word_count": 500
      },
      "score": 8.45,
      "snippets": ["...concurrency patterns in Go..."],
      "rank": 1
    }
  ],
  "total_results": 15,
  "time_taken": "23.5ms",
  "success": true
}

2. Search (GET)

Endpoint: GET /api/v1/search?q=golang&limit=10

Response: Same as POST

3. Get Document

Endpoint: GET /api/v1/document/:id

Response:

{
  "document": {
    "id": 12345,
    "title": "Document Title",
    "text": "Full document text...",
    "url": "https://..."
  },
  "success": true
}

4. Statistics

Endpoint: GET /api/v1/stats

Response:

{
  "total_documents": 600000,
  "total_terms": 2500000,
  "total_queries": 15234,
  "average_query_time": "45.2ms",
  "memory_usage_mb": 450.3,
  "index_size_kb": 102400,
  "uptime": "5h23m"
}

5. Health Check

Endpoint: GET /health

Response:

{
  "status": "healthy",
  "documents": 600000,
  "terms": 2500000,
  "queries": 15234,
  "timestamp": 1640000000
}

Concurrent Indexing Flow

Concurrent indexing uses a worker pool pattern where N workers (based on CPU cores) process documents in parallel, build local indices without locks, then merge at the end for a 1.9x speedup.

A worker pool is a concurrency pattern where:

we create N worker goroutines
send them tasks through a channel
each worker picks up tasks and processes them
wait until all workers finish Instead of doing work one by one, we divide it across multiple workers running in parallel.

In my code this "task" is the documents to index.

// Create worker pool
workers := runtime.NumCPU()
docsChan := make(chan Document, workers)

// Start workers
for i := 0; i < workers; i++ {
    go func() {
        for doc := range docsChan {
            index.AddDocument(doc)  // Process in parallel
        }
    }()
}

// Send work
for _, doc := range documents {
    docsChan <- doc
}

Concurrency Model

I have used two types of concurrency:

1. Worker Pool fo parallel indexing

Indexing uses a worker pool pattern to process documents concurrently.

for i := 0; i < numWorkers; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        for doc := range docsChan {
            e.index.AddDocument(doc)
        }
    }()
}

It works in this way: Starts numWorkers goroutines Each worker: -> Waits for documents from docsChan -> Calls AddDocument(doc) -> Runs until the channel is closed

Documents are distributed using:

for _, doc := range docs {
    docsChan <- doc
}
close(docsChan)

The main goroutine waits for completion:

wg.Wait()

Why I used worker pool in this project: Indexing is CPU-intensive, Each document is independent, Utilizes multiple CPU cores efficiently, Prevents spawning unlimited goroutines.

2. Mutex for thread safety

The engine uses sync.RWMutex to prevent race conditions.

e.mu.Lock() → Used during indexing (write operation)
e.mu.RLock() → Used during search (read operation)

This ensures that -> safe concurrent searches -> no modification of the index while indexing is running.

Search query flow:

The query is processed through the Analyzer (tokenization, normalization, stemming), relevant postings are retrieved from the inverted index, TF-IDF scores are computed for matching documents, results are ranked in descending order, and the top-k documents are returned as a structured JSON response.

TF-IDF Ranking Algorithm

GoSearch uses the TF-IDF (Term Frequency - Inverse Document Frequency) model for keyword-based relevance ranking.

type Posting struct {
    DocID         int
    TermFrequency int  // Used for TF
    DocLength     int  // Used for normalization
}

Term Frequency (TF)

Measures how often a term appears in a document.
Higher frequency ⇒ higher importance within that document.

Inverse Document Frequency (IDF)

Measures how rare a term is across the corpus.
Rare terms receive higher weight, while common terms (e.g., stop words) receive lower weight.

Scoring Formula goes as:

score(term, document) = TF × IDF

The key insights were:

Terms appearing in all documents receive very low (or zero) IDF weight.
Rare, discriminative terms contribute more to ranking.
Documents containing more query-specific terms rank higher.

In the example below (3 small documents), the word "good" appears in all documents and therefore gets minimal ranking weight, while "boy" and "girl" provide stronger ranking signals.

View the deployed project here

Thanks for reading till here! I’ll continue updating this README with more technical details and deployment steps soon.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
cmd/api		cmd/api
frontend		frontend
internal		internal
pkg/search		pkg/search
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
gosearchindexing.png		gosearchindexing.png
gosearchlogo.png		gosearchlogo.png
search-query-flow.png		search-query-flow.png
tf-idf.png		tf-idf.png

Folders and files

Latest commit

History

Repository files navigation

Installation

Prerequisites

Quick Start

Usage

Command Line

Programmatic Usage

Configuration Options

Examples

Features available in Gosearch:

Core Search Engine

Performance Optimizations

API & Integration

Developer Experience

Multi-Language Support

Python example

NodeJS example

Java example

High-Level System Architecture

API Documentation

Base URL

Endpoints

1. Search (POST)

2. Search (GET)

3. Get Document

4. Statistics

5. Health Check

Concurrent Indexing Flow

Concurrency Model

1. Worker Pool fo parallel indexing

2. Mutex for thread safety

Search query flow:

TF-IDF Ranking Algorithm

Term Frequency (TF)

Inverse Document Frequency (IDF)

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages