A production-ready, high-performance metrics collector service written in Go that collects system and application metrics and ships them to remote endpoints with enterprise-grade security.
🚀 Features: System metrics (CPU, Memory, Disk, Network) • GPU monitoring (NVIDIA) • Application endpoint scraping • TLS/mTLS support • Prometheus & HTTP JSON shipping • Docker & Kubernetes ready
- Features
- Quick Start
- Architecture
- Installation
- Configuration
- Usage
- Shipper Types
- TLS Configuration
- Collected Metrics
- Security Considerations
- Deployment
- Performance Tuning
- Development
- Troubleshooting
- FAQ
- Contributing
- License
Get metricsd up and running in 5 minutes:
# Clone and build
git clone https://github.com/0x524A/metricsd.git
cd metricsd
go build -o bin/metricsd cmd/metricsd/main.go
# Create configuration
cp config.example.json config.json
# Edit config.json to set your endpoint
# For example, change endpoint to your Prometheus or metrics collector URL
# Run the service
./bin/metricsd -config config.json
# Check health
curl http://localhost:8080/healthWith TLS:
# Generate self-signed certificates (for testing)
mkdir -p certs && cd certs
openssl req -x509 -newkey rsa:4096 -keyout client.key -out client.crt -days 365 -nodes \
-subj "/CN=metricsd-client"
cd ..
# Update config.json to enable TLS
# Set shipper.tls.enabled to true
# Set certificate paths in shipper.tls section
# Run with TLS
./bin/metricsd -config config.jsonWith Docker:
docker build -t metricsd:latest .
docker run -d -p 8080:8080 -v $(pwd)/config.json:/etc/metricsd/config.json:ro metricsd:latest-
Comprehensive Metrics Collection
- CPU usage (per-core and total utilization)
- Memory usage (RAM and swap statistics)
- Disk I/O and usage statistics
- Network I/O statistics
- GPU metrics via NVIDIA NVML (optional)
- Custom application endpoint scraping
-
Application Metrics Collection
- HTTP endpoint scraping for application metrics
- Support for multiple application endpoints
- JSON-based metrics format
- Configurable timeout and retry logic
-
Flexible Shipping Options
- Prometheus Remote Write protocol with Snappy compression
- HTTP JSON POST
- Advanced TLS/SSL support for secure transmission
- Configurable request timeouts
-
Enterprise-Grade Security
- Full TLS 1.2/1.3 support with custom configuration
- Client certificate authentication (mTLS)
- Custom CA certificate support
- Configurable cipher suites
- SNI (Server Name Indication) support
- TLS version pinning (min/max)
- Session ticket management
- Optional certificate verification bypass for testing
-
Configurable & Extensible
- JSON configuration with environment variable overrides
- Adjustable collection intervals
- Enable/disable specific metric collectors
- Health endpoint for monitoring
- Flexible shipper interface for custom backends
-
Production-Ready
- Structured logging with zerolog
- Graceful shutdown with cleanup
- Error handling and resilience
- SOLID design principles
- Resource cleanup and leak prevention
The service follows SOLID principles with a clean architecture:
metrics-collector/
├── cmd/
│ └── metrics-collector/ # Application entry point
│ └── main.go
├── internal/
│ ├── collector/ # Metric collectors (System, GPU, HTTP)
│ │ ├── collector.go # Collector interface and registry
│ │ ├── system.go # OS metrics collector
│ │ ├── gpu.go # GPU metrics collector
│ │ └── http.go # HTTP endpoint scraper
│ ├── config/ # Configuration management
│ │ └── config.go
│ ├── shipper/ # Metrics shipping
│ │ ├── shipper.go # Shipper interface
│ │ ├── prometheus.go # Prometheus remote write
│ │ └── http_json.go # HTTP JSON shipper
│ ├── orchestrator/ # Collection orchestration
│ │ └── orchestrator.go
│ └── server/ # HTTP server for health checks
│ └── server.go
├── config.example.json # Example configuration
├── go.mod
├── go.sum
└── README.md
- Go 1.24 or later
- NVIDIA drivers and CUDA (optional, for GPU metrics)
# Clone the repository
git clone https://github.com/jainri3/metrics-collector.git
cd metrics-collector
# Download dependencies
go mod download
# Build the binary
go build -o bin/metrics-collector cmd/metrics-collector/main.goCreate a config.json file based on the example:
cp config.example.json config.json{
"server": {
"host": "0.0.0.0",
"port": 8080
},
"collector": {
"interval_seconds": 60,
"enable_cpu": true,
"enable_memory": true,
"enable_disk": true,
"enable_network": true,
"enable_gpu": false
},
"shipper": {
"type": "http_json",
"endpoint": "https://collector.example.com:9090/api/v1/metrics",
"timeout": 30000000000,
"tls": {
"enabled": true,
"cert_file": "/path/to/client-cert.pem",
"key_file": "/path/to/client-key.pem",
"ca_file": "/path/to/ca.pem",
"insecure_skip_verify": false,
"server_name": "collector.example.com",
"min_version": "TLS1.2",
"max_version": "TLS1.3",
"cipher_suites": [
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
],
"session_tickets": true
}
},
"endpoints": [
{
"name": "app1",
"url": "http://localhost:3000/metrics"
}
]
}| Field | Description | Default |
|---|---|---|
server.host |
HTTP server bind address | 0.0.0.0 |
server.port |
HTTP server port | 8080 |
collector.interval_seconds |
Collection interval in seconds | 60 |
collector.enable_cpu |
Enable CPU metrics collection | true |
collector.enable_memory |
Enable memory metrics collection | true |
collector.enable_disk |
Enable disk metrics collection | true |
collector.enable_network |
Enable network metrics collection | true |
collector.enable_gpu |
Enable GPU metrics collection (requires NVIDIA GPU) | false |
shipper.type |
Shipper type: prometheus_remote_write or http_json |
- |
shipper.endpoint |
Remote endpoint URL | - |
shipper.timeout |
Request timeout in nanoseconds | 30000000000 (30s) |
shipper.tls.enabled |
Enable TLS/SSL | false |
shipper.tls.cert_file |
Path to client certificate file (PEM) | - |
shipper.tls.key_file |
Path to client private key file (PEM) | - |
shipper.tls.ca_file |
Path to CA certificate file for server verification | - |
shipper.tls.insecure_skip_verify |
Skip server certificate verification (not recommended) | false |
shipper.tls.server_name |
Server name for SNI (overrides hostname from endpoint) | - |
shipper.tls.min_version |
Minimum TLS version: TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
TLS1.2 |
shipper.tls.max_version |
Maximum TLS version: TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
TLS1.3 |
shipper.tls.cipher_suites |
Array of allowed cipher suites (see Cipher Suites section) | System defaults |
shipper.tls.session_tickets |
Enable TLS session ticket resumption | true |
endpoints |
Array of application HTTP endpoints to scrape | [] |
You can override configuration values using environment variables:
| Environment Variable | Description | Example |
|---|---|---|
MC_SERVER_HOST |
Server bind address | 0.0.0.0 |
MC_SERVER_PORT |
Server port number | 8080 |
MC_COLLECTOR_INTERVAL |
Collection interval in seconds | 60 |
MC_SHIPPER_TYPE |
Shipper type | prometheus_remote_write |
MC_SHIPPER_ENDPOINT |
Shipper endpoint URL | https://metrics.example.com/write |
MC_TLS_ENABLED |
Enable TLS | true |
MC_TLS_CERT_FILE |
Client certificate file path | /etc/metricsd/certs/client.crt |
MC_TLS_KEY_FILE |
Client private key file path | /etc/metricsd/certs/client.key |
MC_TLS_CA_FILE |
CA certificate file path | /etc/metricsd/certs/ca.crt |
MC_TLS_SERVER_NAME |
SNI server name | collector.example.com |
MC_TLS_MIN_VERSION |
Minimum TLS version | TLS1.2 |
MC_TLS_INSECURE_SKIP_VERIFY |
Skip certificate verification | false |
# Run with default config.json
./bin/metrics-collector
# Run with custom config file
./bin/metrics-collector -config /path/to/config.json
# Set log level
./bin/metrics-collector -log-level debugdebug- Detailed debugging informationinfo- General informational messages (default)warn- Warning messageserror- Error messages only
The service exposes a health endpoint:
curl http://localhost:8080/healthResponse:
{
"status": "healthy",
"timestamp": "2025-11-05T12:34:56Z",
"uptime": "1h23m45s"
}Ships metrics using the Prometheus remote write protocol with Snappy compression.
{
"shipper": {
"type": "prometheus_remote_write",
"endpoint": "http://prometheus:9090/api/v1/write"
}
}Ships metrics as JSON via HTTP POST.
{
"shipper": {
"type": "http_json",
"endpoint": "http://collector:8080/api/v1/metrics"
}
}Payload format:
{
"timestamp": 1699185296,
"metrics": [
{
"name": "system_cpu_usage_percent",
"value": 45.2,
"type": "gauge",
"labels": {
"core": "0"
}
}
]
}The service supports advanced TLS configuration for secure communication with remote endpoints. This includes mutual TLS (mTLS), custom cipher suites, and version pinning.
For simple TLS with server certificate verification:
{
"shipper": {
"type": "prometheus_remote_write",
"endpoint": "https://metrics.example.com/api/v1/write",
"tls": {
"enabled": true,
"ca_file": "/etc/metricsd/certs/ca.pem"
}
}
}For client certificate authentication:
{
"shipper": {
"type": "http_json",
"endpoint": "https://secure-collector.example.com/metrics",
"tls": {
"enabled": true,
"cert_file": "/etc/metricsd/certs/client.crt",
"key_file": "/etc/metricsd/certs/client.key",
"ca_file": "/etc/metricsd/certs/ca.crt",
"server_name": "secure-collector.example.com"
}
}
}Full control over TLS parameters:
{
"shipper": {
"tls": {
"enabled": true,
"cert_file": "/etc/metricsd/certs/client.crt",
"key_file": "/etc/metricsd/certs/client.key",
"ca_file": "/etc/metricsd/certs/ca.crt",
"server_name": "metrics.internal.example.com",
"min_version": "TLS1.2",
"max_version": "TLS1.3",
"cipher_suites": [
"TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
"TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
],
"session_tickets": true,
"insecure_skip_verify": false
}
}
}| Option | Description | Values |
|---|---|---|
enabled |
Enable/disable TLS | true, false |
cert_file |
Client certificate for mTLS | Path to PEM file |
key_file |
Client private key for mTLS | Path to PEM file |
ca_file |
CA certificate for server verification | Path to PEM file |
server_name |
SNI hostname override | Domain name |
min_version |
Minimum TLS version | TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
max_version |
Maximum TLS version | TLS1.0, TLS1.1, TLS1.2, TLS1.3 |
cipher_suites |
Allowed cipher suites | Array of suite names |
session_tickets |
Enable session resumption | true, false |
insecure_skip_verify |
Skip certificate verification | true, false (not recommended for production) |
TLS 1.3 Cipher Suites:
TLS_AES_128_GCM_SHA256TLS_AES_256_GCM_SHA384TLS_CHACHA20_POLY1305_SHA256
TLS 1.2 Cipher Suites (Recommended):
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
Additional TLS 1.2 Cipher Suites:
TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHATLS_ECDHE_RSA_WITH_AES_128_CBC_SHATLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHATLS_ECDHE_RSA_WITH_AES_256_CBC_SHATLS_RSA_WITH_AES_128_GCM_SHA256TLS_RSA_WITH_AES_256_GCM_SHA384TLS_RSA_WITH_AES_128_CBC_SHA256TLS_RSA_WITH_AES_128_CBC_SHATLS_RSA_WITH_AES_256_CBC_SHA
Note: If cipher suites are not specified, Go's default secure cipher suite list will be used. TLS 1.3 cipher suites cannot be configured in Go and use the protocol's default settings.
- Use TLS 1.2 or higher - Set
min_versiontoTLS1.2minimum - Enable mTLS - Use client certificates for mutual authentication
- Verify certificates - Keep
insecure_skip_verifyasfalsein production - Use strong cipher suites - Prefer ECDHE and AEAD ciphers
- Configure SNI - Set
server_namewhen using name-based virtual hosting - Rotate certificates - Implement a certificate rotation strategy
- Secure key storage - Protect private keys with appropriate file permissions
Generate self-signed CA:
openssl req -x509 -new -nodes -keyout ca.key -sha256 -days 1825 -out ca.crt \
-subj "/C=US/ST=State/L=City/O=Organization/CN=CA"Generate client certificate:
# Generate private key
openssl genrsa -out client.key 2048
# Generate certificate signing request
openssl req -new -key client.key -out client.csr \
-subj "/C=US/ST=State/L=City/O=Organization/CN=metricsd-client"
# Sign with CA
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out client.crt -days 825 -sha256Set secure file permissions:
chmod 600 /etc/metricsd/certs/*.key
chmod 644 /etc/metricsd/certs/*.crt
chown metricsd:metricsd /etc/metricsd/certs/*Certificate verification failed:
- Ensure CA certificate includes the full chain
- Verify
server_namematches the certificate CN or SAN - Check certificate expiration dates
Handshake failure:
- Verify cipher suites are compatible with server
- Check TLS version compatibility (min/max versions)
- Ensure client certificate is valid and trusted by server
Enable debug logging:
./bin/metricsd -log-level debugCPU:
system_cpu_usage_percent- Per-core CPU usagesystem_cpu_usage_total_percent- Overall CPU usagesystem_cpu_count- Number of CPU cores
Memory:
system_memory_total_bytes- Total memorysystem_memory_used_bytes- Used memorysystem_memory_available_bytes- Available memorysystem_memory_usage_percent- Memory usage percentagesystem_swap_total_bytes- Total swap spacesystem_swap_used_bytes- Used swap spacesystem_swap_usage_percent- Swap usage percentage
Disk:
system_disk_total_bytes- Total disk spacesystem_disk_used_bytes- Used disk spacesystem_disk_free_bytes- Free disk spacesystem_disk_usage_percent- Disk usage percentagesystem_disk_read_bytes_total- Total bytes readsystem_disk_write_bytes_total- Total bytes writtensystem_disk_read_count_total- Total read operationssystem_disk_write_count_total- Total write operations
Network:
system_network_bytes_sent_total- Total bytes sentsystem_network_bytes_recv_total- Total bytes receivedsystem_network_packets_sent_total- Total packets sentsystem_network_packets_recv_total- Total packets receivedsystem_network_errors_in_total- Total input errorssystem_network_errors_out_total- Total output errorssystem_network_drop_in_total- Total input dropssystem_network_drop_out_total- Total output drops
GPU (NVIDIA):
system_gpu_count- Number of GPUssystem_gpu_utilization_percent- GPU utilizationsystem_gpu_memory_utilization_percent- GPU memory utilizationsystem_gpu_memory_total_bytes- Total GPU memorysystem_gpu_memory_used_bytes- Used GPU memorysystem_gpu_memory_free_bytes- Free GPU memorysystem_gpu_temperature_celsius- GPU temperaturesystem_gpu_power_usage_milliwatts- GPU power usagesystem_gpu_fan_speed_percent- Fan speedsystem_gpu_clock_sm_mhz- SM clock speedsystem_gpu_clock_memory_mhz- Memory clock speed
Application metrics are prefixed with app_ and include the endpoint name as a label.
Protect sensitive configuration and certificate files:
# Configuration file
chmod 600 /opt/metricsd/config.json
chown metricsd:metricsd /opt/metricsd/config.json
# Certificate directory
chmod 700 /etc/metricsd/certs
chown -R metricsd:metricsd /etc/metricsd/certs
# Private keys
chmod 600 /etc/metricsd/certs/*.key
# Certificates
chmod 644 /etc/metricsd/certs/*.crtAlways run the service as a dedicated non-privileged user:
# Create dedicated user
sudo useradd -r -s /bin/false -d /opt/metricsd metricsd
# Set ownership
sudo chown -R metricsd:metricsd /opt/metricsd- Use TLS for all remote communications
- Enable mTLS when possible for mutual authentication
- Restrict network access using firewalls
- Use internal/private networks when available
- Regularly update certificates before expiration
- Store sensitive values in environment variables
- Use secrets management tools (HashiCorp Vault, AWS Secrets Manager, etc.)
- Rotate credentials regularly
- Audit configuration changes
- Enable detailed logging for security monitoring
Create /etc/systemd/system/metricsd.service:
[Unit]
Description=Metrics Collector Service (metricsd)
Documentation=https://github.com/0x524A/metricsd
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=metricsd
Group=metricsd
WorkingDirectory=/opt/metricsd
ExecStart=/opt/metricsd/bin/metricsd -config /opt/metricsd/config.json -log-level info
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=10
KillMode=process
TimeoutStopSec=30
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/opt/metricsd
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
# Resource limits
LimitNOFILE=65536
LimitNPROC=512
[Install]
WantedBy=multi-user.targetInstall and enable:
# Copy binary and config
sudo mkdir -p /opt/metricsd/{bin,certs}
sudo cp bin/metricsd /opt/metricsd/bin/
sudo cp config.json /opt/metricsd/
# Create user
sudo useradd -r -s /bin/false -d /opt/metricsd metricsd
# Set permissions
sudo chown -R metricsd:metricsd /opt/metricsd
sudo chmod 600 /opt/metricsd/config.json
sudo chmod 755 /opt/metricsd/bin/metricsd
# Install and start service
sudo cp metricsd.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable metricsd
sudo systemctl start metricsd
# Check status
sudo systemctl status metricsd
sudo journalctl -u metricsd -fPrerequisites:
- Docker installed (version 20.10+ recommended)
- Docker Compose (optional, for easier deployment)
- At least 500MB free disk space for the image
Step 1: Create the Dockerfile
Create a file named Dockerfile in the project root:
FROM golang:1.24-bookworm AS builder
# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
make \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
# Build with all features including GPU support (NVML)
RUN go build -ldflags '-w -s' -o metricsd cmd/metricsd/main.go
FROM debian:bookworm-slim
# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
tzdata \
wget \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN groupadd -g 1000 metricsd && \
useradd -r -u 1000 -g metricsd -s /bin/false metricsd
# Create directories
RUN mkdir -p /etc/metricsd/certs /var/lib/metricsd
RUN chown -R metricsd:metricsd /etc/metricsd /var/lib/metricsd
WORKDIR /home/metricsd
# Copy binary
COPY --from=builder /app/metricsd /usr/local/bin/metricsd
RUN chmod +x /usr/local/bin/metricsd
# Switch to non-root user
USER metricsd
# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
EXPOSE 8080
ENTRYPOINT ["/usr/local/bin/metricsd"]
CMD ["-config", "/etc/metricsd/config.json"]Step 2: Build the Image
# Basic build
docker build -t metricsd:latest .
# Build with custom tag
docker build -t metricsd:v1.0.0 .
# Build with specific platform (for cross-platform)
docker build --platform linux/amd64 -t metricsd:latest .
# Build with build arguments (if needed)
docker build --build-arg GO_VERSION=1.21 -t metricsd:latest .
# Build with no cache (clean build)
docker build --no-cache -t metricsd:latest .
# Build and show build progress
docker build --progress=plain -t metricsd:latest .Step 3: Verify the Build
# List the image
docker images | grep metricsd
# Check image size (should be around 20-30MB)
docker images metricsd:latest --format "{{.Size}}"
# Inspect the image
docker inspect metricsd:latest
# Test run (quick check)
docker run --rm metricsd:latest -helpStep 4: Tag for Registry (Optional)
# Tag for Docker Hub
docker tag metricsd:latest 0x524A/metricsd:latest
docker tag metricsd:latest 0x524A/metricsd:v1.0.0
# Tag for private registry
docker tag metricsd:latest registry.example.com/metricsd:latest
# Push to registry
docker push 0x524A/metricsd:latestOptimizing the Build
Create a .dockerignore file to exclude unnecessary files:
# .dockerignore
.git
.gitignore
.github
README.md
LICENSE
*.md
.vscode
.idea
bin/
*.log
*.tmp
.env
.DS_Store
Makefile
docker-compose.yml
Build Troubleshooting
Common build issues:
# Issue: "cannot find package"
# Solution: Ensure go.mod and go.sum are present
go mod tidy
docker build -t metricsd:latest .
# Issue: "no space left on device"
# Solution: Clean up Docker
docker system prune -a --volumes
# Issue: Build is slow
# Solution: Use BuildKit (faster builds)
DOCKER_BUILDKIT=1 docker build -t metricsd:latest .
# Issue: Platform mismatch (M1 Mac, ARM)
# Solution: Build for specific platform
docker build --platform linux/amd64 -t metricsd:latest .
# Issue: Can't connect to Docker daemon
# Solution: Start Docker or check permissions
sudo systemctl start docker # Linux
sudo usermod -aG docker $USER # Add user to docker groupdocker-compose.yml (for container metrics):
version: '3.8'
services:
metricsd:
build: .
image: metricsd:latest
container_name: metricsd
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ./config.json:/etc/metricsd/config.json:ro
- ./certs:/etc/metricsd/certs:ro
environment:
- MC_LOG_LEVEL=info
- MC_SHIPPER_ENDPOINT=https://prometheus:9090/api/v1/write
- MC_TLS_ENABLED=true
- MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt
- MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key
- MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt
networks:
- metrics
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
networks:
metrics:
driver: bridgedocker-compose.yml (for HOST metrics - recommended for production):
version: '3.8'
services:
metricsd:
build: .
image: metricsd:latest
container_name: metricsd
restart: unless-stopped
# Use host network to access host metrics
network_mode: host
# Use host PID namespace to see host processes
pid: host
volumes:
# Mount host filesystems for accurate host metrics
- /:/rootfs:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./config.json:/etc/metricsd/config.json:ro
- ./certs:/etc/metricsd/certs:ro
environment:
# Tell gopsutil to use host filesystems
- HOST_PROC=/host/proc
- HOST_SYS=/host/sys
- HOST_ROOT=/rootfs
- MC_LOG_LEVEL=info
- MC_SHIPPER_ENDPOINT=https://prometheus:9090/api/v1/write
- MC_TLS_ENABLED=true
- MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt
- MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key
- MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:8080/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
# Privileged mode may be needed for full system access
# privileged: true
# Or use specific capabilities
cap_add:
- SYS_PTRACE
- SYS_ADMINPrerequisites:
- Built Docker image (see steps above)
config.jsonfile prepared- TLS certificates (optional, if using TLS)
Option 1: Quick Start (Container Metrics)
# Prepare configuration
cp config.example.json config.json
# Edit config.json with your settings
# Run container
docker run -d \
--name metricsd \
-p 8080:8080 \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-e MC_LOG_LEVEL=info \
metricsd:latest
# Check if it's running
docker ps | grep metricsd
# View logs
docker logs -f metricsd
# Check health
curl http://localhost:8080/healthOption 2: With TLS (Secure)
# Ensure you have certificates
ls -la certs/
# Should have: client.crt, client.key, ca.crt
# Run with TLS
docker run -d \
--name metricsd \
-p 8080:8080 \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-v $(pwd)/certs:/etc/metricsd/certs:ro \
-e MC_LOG_LEVEL=info \
-e MC_TLS_ENABLED=true \
-e MC_TLS_CERT_FILE=/etc/metricsd/certs/client.crt \
-e MC_TLS_KEY_FILE=/etc/metricsd/certs/client.key \
-e MC_TLS_CA_FILE=/etc/metricsd/certs/ca.crt \
metricsd:latestOption 3: Host Metrics Collection (Recommended for Production)
This mounts host filesystems to collect actual host metrics instead of container metrics:
docker run -d \
--name metricsd-host \
--pid=host \
--network=host \
--restart=unless-stopped \
-v /:/rootfs:ro \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-v $(pwd)/certs:/etc/metricsd/certs:ro \
-e HOST_PROC=/host/proc \
-e HOST_SYS=/host/sys \
-e HOST_ROOT=/rootfs \
-e MC_LOG_LEVEL=info \
metricsd:latestOption 4: Using Docker Compose (Easiest)
# Build and start
docker-compose up -d
# View logs
docker-compose logs -f metricsd
# Stop
docker-compose down
# Rebuild and restart
docker-compose up -d --build
# View service status
docker-compose psContainer Management:
# Stop container
docker stop metricsd
# Start container
docker start metricsd
# Restart container
docker restart metricsd
# Remove container
docker rm -f metricsd
# View logs (last 100 lines)
docker logs --tail 100 metricsd
# Follow logs in real-time
docker logs -f metricsd
# Check container health status
docker inspect --format='{{.State.Health.Status}}' metricsd
# Execute command in container
docker exec -it metricsd sh
# View container resource usage
docker stats metricsd
# Export container logs to file
docker logs metricsd > metricsd.log 2>&1Note: The Deployment below collects pod/container metrics. To collect node/host metrics in Kubernetes, use a DaemonSet instead. See the "Collecting Host Metrics from Docker Container" section for a DaemonSet example.
deployment.yaml (for pod metrics):
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
name: metricsd-config
namespace: monitoring
data:
config.json: |
{
"server": {
"host": "0.0.0.0",
"port": 8080
},
"collector": {
"interval_seconds": 60,
"enable_cpu": true,
"enable_memory": true,
"enable_disk": true,
"enable_network": true,
"enable_gpu": false
},
"shipper": {
"type": "prometheus_remote_write",
"endpoint": "https://prometheus.monitoring.svc.cluster.local:9090/api/v1/write",
"timeout": 30000000000,
"tls": {
"enabled": true,
"cert_file": "/etc/metricsd/certs/tls.crt",
"key_file": "/etc/metricsd/certs/tls.key",
"ca_file": "/etc/metricsd/certs/ca.crt",
"server_name": "prometheus.monitoring.svc.cluster.local",
"min_version": "TLS1.2"
}
},
"endpoints": []
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: metricsd
namespace: monitoring
labels:
app: metricsd
spec:
replicas: 1
selector:
matchLabels:
app: metricsd
template:
metadata:
labels:
app: metricsd
spec:
serviceAccountName: metricsd
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: metricsd
image: metricsd:latest
imagePullPolicy: IfNotPresent
args:
- "-config"
- "/etc/metricsd/config.json"
- "-log-level"
- "info"
ports:
- name: http
containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/metricsd
readOnly: true
- name: certs
mountPath: /etc/metricsd/certs
readOnly: true
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
volumes:
- name: config
configMap:
name: metricsd-config
- name: certs
secret:
secretName: metricsd-tls
---
apiVersion: v1
kind: Service
metadata:
name: metricsd
namespace: monitoring
labels:
app: metricsd
spec:
type: ClusterIP
ports:
- port: 8080
targetPort: http
protocol: TCP
name: http
selector:
app: metricsd
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: metricsd
namespace: monitoringCreate TLS secret:
kubectl create secret generic metricsd-tls \
--from-file=tls.crt=certs/client.crt \
--from-file=tls.key=certs/client.key \
--from-file=ca.crt=certs/ca.crt \
-n monitoringDeploy:
kubectl apply -f deployment.yaml
kubectl get pods -n monitoring
kubectl logs -f -n monitoring deployment/metricsdBy default, a containerized application collects metrics from inside the container (container CPU, container memory, etc.). To collect metrics from the host system instead, you need to mount host filesystems into the container.
- Container metrics: Shows resource usage of the container itself (limited by cgroups)
- Host metrics: Shows actual host machine CPU, memory, disk, and network usage
- Use case: Monitoring the physical/virtual machine where Docker is running
Mount these host paths into your container:
| Host Path | Container Mount | Purpose |
|---|---|---|
/proc |
/host/proc:ro |
Process information, CPU stats |
/sys |
/host/sys:ro |
System information, block devices |
/ |
/rootfs:ro |
Root filesystem for disk metrics |
/var/run/docker.sock |
/var/run/docker.sock:ro |
Docker socket (optional) |
Set these environment variables to tell the gopsutil library to use host paths:
HOST_PROC=/host/proc
HOST_SYS=/host/sys
HOST_ROOT=/rootfsdocker run -d \
--name metricsd-host-metrics \
--pid=host \
--network=host \
--restart=unless-stopped \
-v /:/rootfs:ro \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v /var/run/docker.sock:/var/run/docker.sock:ro \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-e HOST_PROC=/host/proc \
-e HOST_SYS=/host/sys \
-e HOST_ROOT=/rootfs \
-e MC_LOG_LEVEL=info \
metricsd:latestversion: '3.8'
services:
metricsd-host:
image: metricsd:latest
container_name: metricsd-host-metrics
restart: unless-stopped
network_mode: host # Access host network interfaces
pid: host # Access host processes
volumes:
- /:/rootfs:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./config.json:/etc/metricsd/config.json:ro
- ./certs:/etc/metricsd/certs:ro
environment:
- HOST_PROC=/host/proc
- HOST_SYS=/host/sys
- HOST_ROOT=/rootfs
cap_add:
- SYS_PTRACE # For process monitoringWhen collecting host metrics:
- ✅ Use read-only mounts (
:ro) for host filesystems - ✅ Minimize capabilities - only add what's needed (SYS_PTRACE, SYS_ADMIN)
⚠️ Avoidprivileged: trueunless absolutely necessary- ✅ Run as non-root user when possible
- ✅ Review mounted paths - only mount what you need
For Kubernetes, use a DaemonSet to run one pod per node:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: metricsd-host
namespace: monitoring
spec:
selector:
matchLabels:
app: metricsd-host
template:
metadata:
labels:
app: metricsd-host
spec:
hostNetwork: true
hostPID: true
containers:
- name: metricsd
image: metricsd:latest
env:
- name: HOST_PROC
value: /host/proc
- name: HOST_SYS
value: /host/sys
- name: HOST_ROOT
value: /rootfs
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /rootfs
readOnly: true
- name: config
mountPath: /etc/metricsd
- name: certs
mountPath: /etc/metricsd/certs
securityContext:
capabilities:
add:
- SYS_PTRACE
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
- name: config
configMap:
name: metricsd-config
- name: certs
secret:
secretName: metricsd-tlsCheck the logs to ensure host metrics are being collected:
# Check logs
docker logs metricsd-host-metrics
# You should see metrics for ALL host CPUs, not just container limits
# Example: If host has 16 cores, you should see metrics for all 16
# Test with debug logging
docker run --rm -it \
--pid=host \
-v /proc:/host/proc:ro \
-v /sys:/host/sys:ro \
-v $(pwd)/config.json:/etc/metricsd/config.json:ro \
-e HOST_PROC=/host/proc \
-e HOST_SYS=/host/sys \
metricsd:latest -config /etc/metricsd/config.json -log-level debugAdjust based on your needs:
- High-frequency monitoring: 10-30 seconds
- Standard monitoring: 60 seconds (recommended)
- Low-frequency monitoring: 300+ seconds
- Enable session tickets - Reduces TLS handshake overhead
- Use TLS 1.3 - Faster handshake and better performance
- Connection pooling - Automatically handled by the HTTP client
- Keep-alive - Connections are reused between shipments
Typical resource usage:
- CPU: 50-200m (minimal overhead)
- Memory: 50-150 MB RSS
- Network: Depends on metric volume and shipping frequency
Optimize with:
{
"collector": {
"interval_seconds": 60,
"enable_cpu": true,
"enable_memory": true,
"enable_disk": false,
"enable_network": false,
"enable_gpu": false
}
}The service exposes its own health endpoint:
- Monitor HTTP response time at
/health - Check logs for shipping errors
- Monitor system resource usage
- Set up alerts for service failures
# Clone repository
git clone https://github.com/0x524A/metricsd.git
cd metricsd
# Install dependencies
go mod download
# Build
make build
# Run with development config
./bin/metricsd -config config.json -log-level debugmetricsd/
├── cmd/
│ └── metricsd/ # Main application entry point
│ └── main.go
├── internal/ # Internal packages
│ ├── collector/ # Metric collectors
│ │ ├── collector.go # Collector interface & registry
│ │ ├── system.go # System metrics (CPU, memory, disk, network)
│ │ ├── gpu.go # GPU metrics (NVIDIA NVML)
│ │ └── http.go # HTTP endpoint scraper
│ ├── config/ # Configuration management
│ │ └── config.go # Config structs & validation
│ ├── shipper/ # Metric shipping backends
│ │ ├── shipper.go # Shipper interface
│ │ ├── prometheus.go # Prometheus remote write protocol
│ │ └── http_json.go # HTTP JSON POST
│ ├── orchestrator/ # Collection & shipping coordination
│ │ └── orchestrator.go
│ └── server/ # HTTP server (health checks)
│ └── server.go
├── bin/ # Compiled binaries
├── config.json # Runtime configuration
├── config.example.json # Example configuration
├── Makefile # Build automation
├── go.mod # Go module definition
└── README.md # This file
# Run all tests
go test ./...
# Run with coverage
go test -cover ./...
# Generate coverage report
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
# Run specific package tests
go test ./internal/collector/...
# Run with verbose output
go test -v ./...
# Run benchmarks
go test -bench=. ./...# Build for current platform
go build -o bin/metricsd cmd/metricsd/main.go
# Build with optimizations
go build -ldflags="-s -w" -o bin/metricsd cmd/metricsd/main.go
# Build for multiple platforms
GOOS=linux GOARCH=amd64 go build -o bin/metricsd-linux-amd64 cmd/metricsd/main.go
GOOS=darwin GOARCH=amd64 go build -o bin/metricsd-darwin-amd64 cmd/metricsd/main.go
GOOS=windows GOARCH=amd64 go build -o bin/metricsd-windows-amd64.exe cmd/metricsd/main.go
# Using Makefile (if available)
make build
make test
make cleanFollow standard Go conventions:
- Use
gofmtfor formatting - Use
golintfor linting - Use
go vetfor static analysis
# Format code
gofmt -w .
# Run linter
golangci-lint run
# Static analysis
go vet ./...- Create a new collector in
internal/collector/:
package collector
type MyCollector struct {
// fields
}
func NewMyCollector() *MyCollector {
return &MyCollector{}
}
func (c *MyCollector) Collect(ctx context.Context) ([]Metric, error) {
// Implementation
return metrics, nil
}
func (c *MyCollector) Name() string {
return "my_collector"
}- Register in
cmd/metricsd/main.go:
myCollector := collector.NewMyCollector()
registry.Register(myCollector)- Create a new shipper in
internal/shipper/:
package shipper
type MyShipper struct {
endpoint string
client *http.Client
}
func NewMyShipper(endpoint string, tlsConfig *tls.Config) (*MyShipper, error) {
// Implementation
return &MyShipper{...}, nil
}
func (s *MyShipper) Ship(ctx context.Context, metrics []collector.Metric) error {
// Implementation
return nil
}
func (s *MyShipper) Close() error {
// Cleanup
return nil
}-
Add shipper type to config validation in
internal/config/config.go -
Add initialization in
cmd/metricsd/main.go
The project adheres to SOLID principles:
-
Single Responsibility Principle (SRP)
- Each collector focuses on one metric source
- Each shipper handles one protocol
- Orchestrator only coordinates collection and shipping
-
Open/Closed Principle (OCP)
- New collectors can be added without modifying existing code
- New shippers can be plugged in via the interface
- Configuration is extensible
-
Liskov Substitution Principle (LSP)
- All collectors implement the
Collectorinterface - All shippers implement the
Shipperinterface - Components are interchangeable
- All collectors implement the
-
Interface Segregation Principle (ISP)
- Small, focused interfaces (
Collector,Shipper) - Clients depend only on methods they use
- No fat interfaces
- Small, focused interfaces (
-
Dependency Inversion Principle (DIP)
- High-level modules depend on abstractions (interfaces)
- Concrete implementations are injected
- Loose coupling throughout the codebase
Service won't start:
# Check logs
sudo journalctl -u metricsd -n 50
# Verify configuration
./bin/metricsd -config config.json # Should show validation errors
# Check file permissions
ls -la /opt/metricsd/config.json
ls -la /etc/metricsd/certs/TLS handshake errors:
# Test TLS connection
openssl s_client -connect metrics.example.com:443 \
-cert /etc/metricsd/certs/client.crt \
-key /etc/metricsd/certs/client.key \
-CAfile /etc/metricsd/certs/ca.crt
# Verify certificate
openssl x509 -in /etc/metricsd/certs/client.crt -text -noout
# Check certificate expiration
openssl x509 -in /etc/metricsd/certs/client.crt -checkend 0Metrics not shipping:
- Check network connectivity to endpoint
- Verify TLS configuration
- Check endpoint authentication requirements
- Review logs for error messages
- Test endpoint manually with curl
High memory usage:
- Reduce collection frequency
- Disable unused collectors
- Check for memory leaks in logs
- Monitor with pprof if needed
Permission denied errors:
# Fix ownership
sudo chown -R metricsd:metricsd /opt/metricsd
sudo chown -R metricsd:metricsd /etc/metricsd
# Fix permissions
sudo chmod 600 /opt/metricsd/config.json
sudo chmod 600 /etc/metricsd/certs/*.key
sudo chmod 644 /etc/metricsd/certs/*.crtQ: Can I use metricsd without TLS?
A: Yes, set shipper.tls.enabled to false. However, TLS is strongly recommended for production.
Q: Does metricsd support custom metrics?
A: Yes, add application endpoints to the endpoints array in the configuration. The HTTP collector will scrape them.
Q: How do I rotate TLS certificates? A: Update the certificate files, then restart the service. Consider implementing a certificate rotation process with minimal downtime.
Q: Can I ship to multiple endpoints? A: Currently, one shipper endpoint is supported per instance. Run multiple instances for multiple destinations.
Q: What's the performance impact? A: Minimal. Typical CPU usage is <1% and memory usage is around 50-150MB depending on enabled collectors.
Q: How do I monitor metricsd itself?
A: Use the /health endpoint and monitor the service logs. You can also use process monitoring tools.
Q: Does it work on Windows? A: Yes, but some system metrics may have limited support. GPU metrics require NVIDIA drivers.
Q: Can I use this with Grafana? A: Yes, ship metrics to Prometheus (using remote write) and configure Grafana to query Prometheus.
Q: How do I debug TLS issues?
A: Enable debug logging with -log-level debug and review the detailed TLS handshake logs.
Q: Is IPv6 supported? A: Yes, both IPv4 and IPv6 are supported for all network operations.
Q: How do I collect host metrics when running in Docker?
A: Mount the host's /proc, /sys, and / into the container and set environment variables. See the "Collecting Host Metrics from Docker Container" section for complete instructions.
Q: Why are my CPU/memory metrics showing container limits instead of host resources?
A: Without host filesystem mounts, the container only sees its own cgroup limits. Mount host paths and set HOST_PROC=/host/proc and HOST_SYS=/host/sys to collect host metrics.
- Add support for multiple shipper endpoints
- Implement metric aggregation and buffering
- Add support for metric filtering and transformation
- Implement retry logic with exponential backoff
- Add support for custom labels on system metrics
- Implement metric caching for offline scenarios
- Add Datadog, InfluxDB, and other shipper backends
- Add web UI for configuration and monitoring
- Implement metric sampling for high-volume scenarios
- Add support for Windows-specific metrics
- Implement health check with detailed status information
MIT License - see LICENSE file for details
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Add tests for new functionality
- Ensure tests pass (
go test ./...) - Format your code (
gofmt -w .) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow Go best practices and idioms
- Maintain SOLID design principles
- Add tests for new functionality
- Update documentation as needed
- Keep commits atomic and well-described
- Ensure backward compatibility when possible
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: This README and inline code comments
When reporting bugs, please include:
- metricsd version
- Operating system and version
- Go version
- Configuration file (sanitized)
- Relevant log output
- Steps to reproduce
Feature requests are welcome! Please:
- Check existing issues first
- Provide detailed use case
- Explain expected behavior
- Consider contributing the feature
Built with:
- zerolog - Fast structured logging
- gopsutil - System metrics collection
- prometheus/client_golang - Prometheus integration
- NVML - GPU metrics
- Your Name - Initial work
See also the list of contributors who participated in this project.
Made with ❤️ by the metricsd team