GlobalLinks Project

GlobalLinks is a backlink gathering tool based on the Common Crawl dataset. It's currently in alpha and under active development.

Features

Multithreaded processing of links.
Parses up to 300,000 pages per minute per thread.

Demo

Backlink search demo

Configuration

Control the number of threads with the GLOBALLINKS_MAXTHREADS environment variable:

export GLOBALLINKS_MAXTHREADS=4

Control the number of WAT files parsed in one go GLOBALLINKS_MAXWATFILES environment variable:

export GLOBALLINKS_MAXWATFILES=10

Set path for data files , default "data" GLOBALLINKS_DATAPATH environment variable:

export GLOBALLINKS_DATAPATH=data

Usage

Start by selecting an archive and its segment name from Common Crawl https://www.commoncrawl.org/get-started. Then run the following command:

go run cmd/importer/main.go CC-MAIN-2021-04 900 4 0-10

go run cmd/importer/main.go [archive_name] [num_files_to_process] [num_threads] [num_segments]

Replace CC-MAIN-2021-04 with your chosen archive name. One segment had up to 1000 files, num_treads is the number of processor threads to use and num segment is the number of segment to import or range: examples 10 , or 5-10, there are 100 segments in one archive

Distributing backlinks data into tree directory structure to be able to build API on top of it.

go run cmd/storelinks/main.go data/links/compact_0.txt.gz data/linkdb

Replace data/links/compact_0.txt.gz with your chosen compacted links file and data/linkdb with your chosen output directory. Repeating this command for all compacted segment links files will update the tree directory structure in data/linkdb.

Compacting links files into one file manually. It is possible to compact files later:

go run cmd/storelinks/main.go compacting data/links/sort_50.txt.gz data/links/compact_50.txt.gz

Test settings

wat.go file contains line "const debugTestMode = false". Setting it to true import only 10 files from 3 segments. Allow to watch whole process on limited data. It will use only 30 files for test and not 90000.

Output

links files are stored in data/links/

pages files are stored in data/pages/

Format

Docker compose

Build the docker image, and collect the data from the archive CC-MAIN-2021-04 for 6 files and 4 threads. There are 90000 files to collect in one archive, 720 files for one segment.

make compose-up ARCHIVENAME="CC-MAIN-2021-04" GLOBALLINKS_MAXWATFILES=6 GLOBALLINKS_MAXTHREADS=4

Data will be stored in watdata directory, you can restart the process multiple times it will continue from the last file.

Docker

Build the docker image.

make docker-build

Example how to use docker image to collect data from the archive CC-MAIN-2021-04 for 6 files and 4 threads.

ARCHIVENAME='CC-MAIN-2021-04' GLOBALLINKS_MAXWATFILES='6' GLOBALLINKS_MAXTHREADS='4' make docker-start

Data will be also stored in watdata directory.

Stop running docker image.

make docker-stop

Docker Image Availability

The Docker image is available on Docker Hub.

Parameters Description

Archive Name: CC-MAIN-2021-04 - Name of the archive to be parsed.
Number of Files: 4 - Number of files to be parsed. Currently, there are 90,000 files in one archive, with 900 in each segment. Parsing at least one segment is necessary to obtain compacted results.
Number of Threads: 2 - Number of threads to use (ranging from 1 to 16).
Segments id: 2 or 0-10 - Number of segments to import. Range from 0 to 99. Format 2,3,4,5 or 2-5 is accepted.

Resource Utilization and Performance

Memory Usage: One tread typically consumes approximately 1.5 GB of RAM. Therefore, running 4 threads will require about 6 GB of RAM. 4GB of RAM is the minimum requirement.
Processing Time: The time taken to process one segment varies depending on CPU and network speed. It generally takes a few hours.

Data Storage

Data will be stored in the watdata directory:

links: Contains the final parsed segment links.
pages: Includes the final parsed segment pages.
tmp/links: Temporary storage for parsed segment link files.
tmp/pages: Temporary storage for parsed segment page files.
tmp/wat: Temporary storage for downloaded WAT files to be parsed.

MongoDB

Final data will be stored in MongoDB. The database name is linkdb and the collection name is links

Storage config in /etc/mongodb.conf:

storage:
  dbPath: /var/lib/mongodb
  engine: wiredTiger
  wiredTiger:
    engineConfig:
      directoryForIndexes: true
    collectionConfig:
      blockCompressor: zlib

This enables compression for the database and separate directory for indexes. Every segment require approximately 1.6 GB space for the collection and 300 MB for the indexes.

LinkDB API

The LinkDB API provides access to the collected backlink data via HTTP endpoints.

Starting the API Server

Without Authentication (Local Development)

go run cmd/linksapi/main.go localhost 27017 linkdb

With MongoDB Authentication

export MONGO_USERNAME=your_username
export MONGO_PASSWORD=your_password
export MONGO_AUTH_DB=admin
go run cmd/linksapi/main.go mongodb_host 27017 database_name

Environment Variables

MONGO_USERNAME: MongoDB username (optional)
MONGO_PASSWORD: MongoDB password (optional)
MONGO_AUTH_DB: Authentication database (optional, defaults to "admin")
GO_ENV: Set to "production" for HTTPS on port 8443, otherwise HTTP on port 8010

API Endpoints

Health Check: GET /api/health
Get Domain Links: POST /api/links

Usage Examples

# Health check
curl -X GET http://localhost:8010/api/health

# Get backlinks for a domain
curl -X POST http://localhost:8010/api/links \
  -H "Content-Type: application/json" \
  -d '{"domain": "example.com", "limit": 10}'

# Get backlinks with IP filter
curl -X POST http://localhost:8010/api/links \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "example.com",
    "filters": [
      {
        "name": "IP",
        "val": "192.168.1.1",
        "kind": "exact"
      }
    ]
  }'

Available Filters

No Follow: Filter by nofollow attribute
Link Path: Filter by target link path
Source Host: Filter by source page hostname
Source Path: Filter by source page path
Anchor: Filter by anchor text
IP: Filter by IP address

For complete API documentation, see LINKDB.md.

🐳 Docker Deployment

The LinkDB API is available as a Docker container with automatic builds on version tags.

🔐 Authentication Required

Since this is a private repository, authenticate with GitHub Container Registry:

# Login to GitHub Container Registry
docker login ghcr.io
# Username: your-github-username
# Password: your-github-personal-access-token (with read:packages scope)

📦 Available Images

# Latest stable release
ghcr.io/kris-dev-hub/globallinks-linksapi:latest

# Specific versions
ghcr.io/kris-dev-hub/globallinks-linksapi:v1.0.0

🚀 Quick Deploy

Standalone API (No Authentication):

docker pull ghcr.io/kris-dev-hub/globallinks-linksapi:latest

docker run -d \
  --name linksapi \
  -p 8010:8010 \
  -e MONGO_HOST=your_mongo_host \
  -e MONGO_PORT=27017 \
  -e MONGO_DATABASE=linkdb \
  ghcr.io/kris-dev-hub/globallinks-linksapi:latest

With MongoDB Authentication:

docker run -d \
  --name linksapi \
  -p 8010:8010 \
  -e MONGO_HOST=your_mongo_host \
  -e MONGO_PORT=27017 \
  -e MONGO_DATABASE=linkdb \
  -e MONGO_USERNAME=your_username \
  -e MONGO_PASSWORD=your_password \
  -e MONGO_AUTH_DB=admin \
  ghcr.io/kris-dev-hub/globallinks-linksapi:latest

Production Mode (HTTPS):

docker run -d \
  --name linksapi \
  -p 8443:8443 \
  -v /path/to/certs:/app/cert \
  -e GO_ENV=production \
  -e MONGO_HOST=your_mongo_host \
  -e MONGO_USERNAME=your_username \
  -e MONGO_PASSWORD=your_password \
  -e MONGO_DATABASE=linkdb \
  ghcr.io/kris-dev-hub/globallinks-linksapi:latest

🔧 Environment Variables

Variable	Description	Default	Required
`MONGO_HOST`	MongoDB hostname	localhost	No
`MONGO_PORT`	MongoDB port	27017	No
`MONGO_DATABASE`	Database name	linkdb	No
`MONGO_USERNAME`	MongoDB username	-	No
`MONGO_PASSWORD`	MongoDB password	-	No
`MONGO_AUTH_DB`	Authentication database	admin	No
`GO_ENV`	Environment (development/production)	development	No

📋 Connecting to External MongoDB

The LinkDB API container is designed to connect to external MongoDB instances:

Connect to existing MongoDB:

# Connect to MongoDB running on host
docker run -d \
  --name linksapi \
  -p 8010:8010 \
  -e MONGO_HOST=192.168.1.105 \
  -e MONGO_USERNAME=linksuser \
  -e MONGO_PASSWORD=Nft9Vyr94vKMVnif \
  -e MONGO_DATABASE=linksdb \
  -e MONGO_AUTH_DB=admin \
  ghcr.io/kris-dev-hub/globallinks-linksapi:latest

Connect to cloud MongoDB:

# Connect to MongoDB Atlas or other cloud providers
docker run -d \
  --name linksapi \
  -p 8010:8010 \
  -e MONGO_HOST=cluster0.mongodb.net \
  -e MONGO_PORT=27017 \
  -e MONGO_USERNAME=dbuser \
  -e MONGO_PASSWORD=dbpass \
  -e MONGO_DATABASE=linkdb \
  -e MONGO_AUTH_DB=admin \
  ghcr.io/kris-dev-hub/globallinks-linksapi:latest

📱 Container Features

Minimal size: ~15-25MB Alpine-based image
Memory efficient: ~10-30MB RAM usage
Security: Non-root user execution
Health checks: Built-in health endpoint monitoring
Multi-environment: Supports both development and production modes

🏗️ Build Locally

# Build the Docker image locally
docker build -t linksapi .

# Run locally built image
docker run -d -p 8010:8010 \
  -e MONGO_HOST=localhost \
  -e MONGO_DATABASE=linkdb \
  linksapi

📋 Versioning and Releases

The project uses semantic versioning with automatic Docker image builds:

Creating a Release:

# Tag a new version (triggers automated build)
git tag v1.0.0
git push origin v1.0.0

# Images are automatically built and pushed to:
# ghcr.io/kris-dev-hub/globallinks-linksapi:v1.0.0
# ghcr.io/kris-dev-hub/globallinks-linksapi:latest

☸️ Kubernetes Deployment

Example Kubernetes Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: linksapi
spec:
  replicas: 2
  selector:
    matchLabels:
      app: linksapi
  template:
    metadata:
      labels:
        app: linksapi
    spec:
      containers:
      - name: linksapi
        image: ghcr.io/kris-dev-hub/globallinks-linksapi:latest
        ports:
        - containerPort: 8010
        env:
        - name: MONGO_HOST
          value: "mongodb-service"
        - name: MONGO_USERNAME
          valueFrom:
            secretKeyRef:
              name: mongo-secret
              key: username
        - name: MONGO_PASSWORD
          valueFrom:
            secretKeyRef:
              name: mongo-secret
              key: password
        - name: MONGO_DATABASE
          value: "linkdb"
        resources:
          requests:
            memory: "32Mi"
            cpu: "10m"
          limits:
            memory: "64Mi"
            cpu: "100m"
---
apiVersion: v1
kind: Service
metadata:
  name: linksapi-service
spec:
  selector:
    app: linksapi
  ports:
  - port: 80
    targetPort: 8010
  type: ClusterIP

Example

docker pull krisdevhub/globallinks:latest   
docker run --name globallinks-test -d -v ./watdata:/app/data krisdevhub/globallinks:latest /app/importer CC-MAIN-2021-04 4 2

At the end you can also set number of segments you want to import. Range from 0 to 99. Format 2,3,4,5 or 2-5 is accepted.

Data

example record from link file:

blogmedyczny.edu.pl||/czasopisma-kobiece-i-tabletki-na-odchudzanie/||2|turysta24.pl|/tabletki-odchudzajace-moga-pomoc-zredukowac-wage/||2|turysta24.pl|/tabletki-odchudzajace-moga-pomoc-zredukowac-wage/||2|Theme Palace|0|0|2023-02-04|51.75.43.178

LinkedDomain|LinkedSubdomain|LinkedPath|LinkedQuery|LinkedScheme|PageHost|PagePath|PageQuery|PageScheme|LinkText|NoFollow|NoIndex|DateImported|IP

There are around 6 billion unique external backlinks per month in the common crawl data and the application is able to analyse and collect them all.

System Requirements

Go 1.21 or later.
4GB of RAM is the minimum requirement. Requires 1.5GB of RAM per every next thread for parsing.
Minimum 50GB of free disk for every segment parsed at the same time.
MongoDB require minimum 2GB of disc space for every segment. 200MB of ram for every imported segment is optimal.
lzop installed on the system.

Alpha Version Disclaimer

This is an alpha version of GlobalLinks and is subject to changes. The software is provided "as is", without warranty of any kind.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
cmd		cmd
pkg		pkg
.env.example		.env.example
.gitignore		.gitignore
.golangci.toml		.golangci.toml
Dockerfile		Dockerfile
Dockerfile.linksapi		Dockerfile.linksapi
LICENSE		LICENSE
LINKDB.md		LINKDB.md
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum

License

kris-dev-hub/globallinks

Folders and files

Latest commit

History

Repository files navigation

GlobalLinks Project

Features

Demo

Configuration

Usage

Test settings

Output

Format

Docker compose

Docker

Docker Image Availability

Parameters Description

Resource Utilization and Performance

Data Storage

MongoDB

LinkDB API

Starting the API Server

Without Authentication (Local Development)

With MongoDB Authentication

Environment Variables

API Endpoints

Usage Examples

Available Filters

🐳 Docker Deployment

🔐 Authentication Required

📦 Available Images

🚀 Quick Deploy

🔧 Environment Variables

📋 Connecting to External MongoDB

📱 Container Features

🏗️ Build Locally

📋 Versioning and Releases

☸️ Kubernetes Deployment

Example

Data

System Requirements

Alpha Version Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

Packages