Skip to content

auxacc1/linux-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Linux Recognition

Project Overview

Linux Recognition is an asynchronous backend service that identifies Linux source packages from software fingerprints. The system normalizes input fingerprints and employs distro-specific logic to resolve package origins. It fetches detailed metadata (source package name, description, licenses, homepage) and enriches it using public repository APIs, obtaining additional information (author/publisher, release date, CPE string).

flowchart TD
    A[Raw Fingerprints] --> B[Normalize Fingerprints]
    B --> C{Check Recognized DB}
    C -->|Exists| D[Skip Processing]
    C -->|New| E[Process Batch]
    E --> F[Recognize Packages]
    F --> G{Is Library Package?}
    G --> |Other| I[Use Distro Class]
    I --> |Fail| J[Use Universal Lookup]
    G --> |library| K[Enrich Via Public Project API]
    J --> K
    K --> L[Resolve Properties]
    L --> M[Generate AI Description]
    L --> N[SPDX License Matching]
    L --> O[Resolve CPE]
    E --> P[Update Recognized DB]
Loading

Features

Fingerprint Processing

  • Input: List of dictionaries with software, publisher, and version fields (e.g., from SCCM inventory)
  • Normalization: Standardizes software names and versions

Distro-Specific Package Recognition

  • Supported Distros: Debian, Ubuntu, Fedora, OpenSUSE, Arch, Alpine and others
  • Two-Stage Lookup:
    • UniversalPackage: Uses Repology database dump
    • Fallback to Native APIs (e.g., Debian/Ubuntu → Ultimate Debian Database).

Metadata Enrichment

  • Supported Repos: GitHub, GitLab, Pagure, MetaCPAN, SourceForge, PyPI, RubyGems
  • Process:
    • Updates initial metadata (description, license) from repo APIs
    • Extracts release dates by matching version to tags/releases using custom heuristics
    • Parses unstructured date formats using generalized date parser

AI-Powered Features

  • Package description: Uses OpenAI to generate 2-sentence formal description from raw text

  • SPDX License Normalization:

    • Creates Faiss embeddings of SPDX license database:
    • Matches licenses via:
      • Exact database lookup
      • Similarity search (threshold-based)
      • Fallback to original license + DB update

Miscellaneous

  • JSON Logging: extensive structured JSON logs for detailed traceability and analysis
  • Configuration Validation: Pydantic models for robust configuration

Strengths

  • Provenance Resolution: Accurately traces packages across multiple Linux distros despite naming inconsistencies
  • License Compliance: Automates SPDX license matching using AI similarity thresholds
  • Handles real-world data variance through:
    • Flexible version matching
    • Robust date parsing
    • Multi-source metadata reconciliation

Tech Stack

  • Fully asynchronous workflow (aiohttp + asyncpg)
  • PostgreSQL storage
  • FAISS + Langchain
  • pytest (test coverage)

Prerequisites

Installation & Setup

# 1. Install from source (recommended)
git clone https://github.com/auxacc1/linux-recognition
cd linux-recognition
pip install .  # Installs package and dependencies from pyproject.toml

# 2. Initialize project (one-time setup)
linux_recognition_initialize  # creation + rebuild of databases and data download

Usage

import asyncio

from linux_recognition.recognition import recognize
from linux_recognition.typestore.datatypes import FingerprintDict


raw_fingerprints: list[FingerprintDict] = [
    {
    'software': 'yast2-ycp-ui-bindings',
    'publisher': 'SUSE LLC <https://www.suse.com/>',
    'version': '4.3.9'
    },
    {
    'software': 'nodejs010-nodejs-osenv',
    'publisher': 'Red Hat, Inc.',
    'version': '0.1.3-1.el7'
    },
]


asyncio.run(recognize(raw_fingerprints))

Configuration

Use config file to customize database connections, logging behavior, and AI model selection.

Required Credentials

Set these as environment variables with the prefix LINUX_RECOGNITION__ (e.g., LINUX_RECOGNITION__OPENAI_API_KEY):

Variable Second Header
DATABASE__POSTGRES_DEFAULT__PASSWORD Password for local PostgreSQL databases
PYPI_USER_AGENT (Non-secret) User-agent string for PyPI API requests
SOURCEFORGE_BEARER Bearer token for SourceForge API access
OBS_USERNAME & OBS_PASSWORD Credentials for Open Build Service (OBS)
GITHUB_TOKEN GitHub Personal Access Token for API access
OPENAI_API_KEY API key for OpenAI services

License

MIT License

More info soon