A Python tool that identifies package coordinates (name, version, license, PURL) from source code directories using a hybrid discovery strategy with manifest parsing, code fingerprinting, repository search, and Software Heritage archive integration.
- 2-Phase Discovery Strategy: Hash-based identification enhanced by manifest parsing
- Multi-Ecosystem Support: PyPI, NPM, Maven, Go, Ruby Gems, PHP, .NET, and more
- Fast Performance: 5-15 seconds for typical projects (vs 90+ seconds with SWH)
- SEMCL.ONE Integration: Works seamlessly with upmex, osslili, and other ecosystem tools
pip install src2purlFor development:
git clone https://github.com/SemClone/src2purl.git
cd src2purl
pip install -e .# Identify package from source code
src2purl /path/to/source/code
# With Software Heritage archive (comprehensive but slower)
src2purl /path/to/source --use-swh# Fast discovery (default) - Uses manifest parsing + SCANOSS + GitHub (5-15 seconds)
src2purl /path/to/source/code
# Comprehensive discovery - Includes Software Heritage archive (90+ seconds)
src2purl /path/to/source --use-swh
# High confidence matches only
src2purl /path/to/source --confidence-threshold 0.85
# JSON output format for integration
src2purl /path/to/source --output-format json
# Detect subcomponents in monorepos
src2purl /path/to/source --detect-subcomponentsfrom src2purl import identify_package
# Basic identification
result = identify_package("/path/to/source")
print(f"Package: {result.name}@{result.version}")
print(f"PURL: {result.purl}")
# With options
result = identify_package(
path="/path/to/source",
use_swh=True,
confidence_threshold=0.85
)# Default: 2-Phase hybrid approach (5-15 seconds)
# Phase 1: Hash-based discovery + Phase 2: UPMEX manifest parsing
# Good for: Most use cases, balanced speed and accuracy
src2purl /path/to/project
# Comprehensive: Include Software Heritage archive (90+ seconds)
# Phase 1 includes deep provenance discovery + Phase 2: UPMEX enhancement
# Good for: Security audits, research, historical analysis
export SWH_API_TOKEN=your_token # Optional but recommended
src2purl /path/to/project --use-swh
# High confidence: Only report confident matches
# Filters results to highest confidence findings from both phases
src2purl /path/to/project --confidence-threshold 0.85No API keys required! The tool works with the free public APIs. API keys only provide enhanced rate limits and additional features.
1. GitHub API - Most Valuable (Free, 2 minutes to setup)
export GITHUB_TOKEN=your_github_personal_access_token- Get token: https://github.com/settings/tokens (no special permissions needed)
- Benefits:
- Rate limit: 10 → 5000 requests/hour
- Better search: More accurate repository identification
- No cost: Completely free
- Impact: Significant improvement for repository discovery
2. SCANOSS API - Nice to Have (Free, optional)
export SCANOSS_API_KEY=your_scanoss_key- Get token: Register at https://www.scanoss.com
- Benefits:
- No cost: Free tier available
- Enhanced rate limits: Premium API endpoint
- Additional features: Possible extra metadata
- Impact: Minor improvement (SCANOSS works great without key)
3. Software Heritage API - For Heavy Usage (Free, only if using --use-swh)
export SWH_API_TOKEN=your_swh_token- Get token: Register at https://archive.softwareheritage.org/api/
- Benefits:
- Bypass rate limits: No 60-second waits
- Faster comprehensive scans: When using
--use-swh
- Impact: Essential for
--use-swhflag, not needed for default fast mode
| Configuration | Typical Time | API Calls | Best For |
|---|---|---|---|
| No API keys | 5-15 seconds | Minimal | Most users |
| + GitHub token | 5-15 seconds | Enhanced | Recommended setup |
| + All tokens | 5-15 seconds | Premium | Production use |
| + SWH mode | 90+ seconds | Heavy | Security audits |
Recommendation: Start with GitHub token only - it's free, fast to setup, and provides the biggest improvement.
# Generate and validate SWHID for a directory
src2purl-validate /path/to/directory
# Compare against expected SWHID
src2purl-validate /path/to/directory --expected-swhid swh:1:dir:abc123...
# Use fallback implementation
src2purl-validate /path/to/directory --use-fallback --verbosepath: Directory path to analyze (required)--confidence-threshold: Minimum confidence to report matches (default: 0.3)--output-format: Output format: 'json' or 'table' (default: table)--verbose: Verbose output for debugging
--use-swh: Include Software Heritage archive checking (optional, adds 90+ seconds)--no-license-detection: Skip automatic license detection from local source (faster)--detect-subcomponents: Detect and identify subcomponents in monorepos--max-depth: Maximum directory depth to scan (default: 2)
--no-cache: Disable API response caching--clear-cache: Clear all cached API responses and exit
--api-token: Software Heritage API token (only used with --use-swh)- Environment variables:
GITHUB_TOKEN,SCANOSS_API_KEY,SWH_API_TOKEN
# Default: 2-Phase Discovery (Hash-based + UPMEX manifest parsing)
src2purl /path/to/project
# Phase 1: SWHID generation → GitHub/SCANOSS search
# Phase 2: UPMEX manifest extraction → Cross-validation & enhancement
# Add Software Heritage for comprehensive provenance discovery
src2purl /path/to/project --use-swh
# Phase 1: SWHID generation → Software Heritage archive → Fallback search
# Phase 2: UPMEX manifest extraction → Cross-validation & enhancement
# Performance optimization options
src2purl /path/to/project --no-license-detection # Skip license enhancement
src2purl /path/to/project --use-swh --verbose # Full discovery with detailsSRC2PURL is a core component of the SEMCL.ONE ecosystem:
- Uses upmex for manifest parsing and package metadata extraction
- Integrates with osslili for enhanced license detection
- Provides PURLs for purl2src to download source packages
- Supports purl2notices for generating attribution documentation
- Enables ospac policy evaluation with identified packages
Configuration via environment variables:
# API tokens (optional but recommended)
export GITHUB_TOKEN=your_github_token
export SCANOSS_API_KEY=your_scanoss_key
export SWH_API_TOKEN=your_swh_token
# Performance settings
export SRC2PURL_CACHE_DIR=~/.cache/src2purl
export SRC2PURL_MAX_DEPTH=2- User Guide - Comprehensive usage examples
- API Reference - Python API documentation
- Discovery Methods - Detailed explanation of identification strategies
- Examples - Common use cases and workflows
We welcome contributions! Please see CONTRIBUTING.md for details on:
- Code of conduct
- Development setup
- Submitting pull requests
- Reporting issues
For support and questions:
- GitHub Issues - Bug reports and feature requests
- Documentation - Complete project documentation
- SEMCL.ONE Community - Ecosystem support and discussions
GNU Affero General Public License v3.0 (AGPL-3.0) - see LICENSE file for details.
See AUTHORS.md for a list of contributors.
Part of the SEMCL.ONE ecosystem for comprehensive OSS compliance and code analysis.