Convert security data from 15 sources into Subject-Predicate-Object (SPO) knowledge-graph triples in Parquet format.
Sources: ATT&CK · CAPEC · CWE · CVE · CPE · D3FEND · ATLAS · CAR · ENGAGE · EPSS · KEV · Vulnrichment · GHSA · Sigma · ExploitDB
---
config:
layout: dagre
theme: neo
---
flowchart LR
STIX["ATT&CK STIX JSON"]:::src --> CONV["convert.py"]:::conv
CXML["CAPEC XML"]:::src --> CONV
WXML["CWE XML"]:::src --> CONV
CVEJ["CVE JSON 5.x"]:::src --> CONV
CPEJ["CPE JSON"]:::src --> CONV
D3FJ["D3FEND JSON-LD"]:::src --> CONV
ATLY["ATLAS YAML"]:::src --> CONV
CARY["CAR YAML"]:::src --> CONV
ENGJ["ENGAGE JSON"]:::src --> CONV
EPSC["EPSS CSV"]:::src --> CONV
KEVJ["KEV JSON"]:::src --> CONV
VULJ["Vulnrichment JSON"]:::src --> CONV
GHSJ["GHSA JSON"]:::src --> CONV
SIGY["Sigma YAML"]:::src --> CONV
EDBC["ExploitDB CSV"]:::src --> CONV
CONV --> ATK["enterprise / mobile / ics / attack-all"]:::out --> CMB["combined.parquet"]:::conv
CONV --> CAP["capec"]:::out --> CMB
CONV --> CW["cwe"]:::out --> CMB
CONV --> CVE["cve"]:::out --> CMB
CONV --> CPE["cpe"]:::out --> CMB
CONV --> D3F["d3fend"]:::out --> CMB
CONV --> ATL["atlas"]:::out --> CMB
CONV --> CAR["car"]:::out --> CMB
CONV --> ENG["engage"]:::out --> CMB
CONV --> EPS["epss"]:::out --> CMB
CONV --> KEV["kev"]:::out --> CMB
CONV --> VUL["vulnrichment"]:::out --> CMB
CONV --> GHS["ghsa"]:::out --> CMB
CONV --> SIG["sigma"]:::out --> CMB
CONV --> EDB["exploitdb"]:::out --> CMB
CMB --> HF["HuggingFace Hub"]:::hf
classDef src fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
classDef conv fill:#f3f4f6,stroke:#6b7280,color:#374151
classDef out fill:#fef3c7,stroke:#f59e0b,color:#78350f
classDef hf fill:#d1fae5,stroke:#10b981,color:#064e3b
---
config:
layout: dagre
theme: neo
---
graph LR
%% ATT&CK core
C[Campaign]:::attack -->|attributed-to| G[Group]:::attack
C -->|uses| T[Technique]:::attack
G -->|uses| T
G -->|uses| SW[Malware / Tool]:::attack
SW -->|uses| T
ST[Sub-technique]:::attack -->|subtechnique-of| T
T -->|belongs-to-tactic| TAC[Tactic]:::attack
MIT[Mitigation]:::attack -->|mitigates| T
DC[DataComponent]:::attack -->|detects| T
%% Defense & detection → Technique
DT[DefensiveTechnique]:::d3fend -->|counters| T
AN[Analytic]:::car -->|detects-technique| T
AN -->|maps-to-d3fend| DT
EA[EngagementActivity]:::engage -->|engages-technique| T
AT[ATLAS Technique]:::atlas -->|related-attack-technique| T
%% CAPEC ↔ CWE bridge
AP[Attack Pattern]:::capec -->|maps-to-technique| T
AP -->|related-weakness| W[Weakness]:::cwe
W -->|related-attack-pattern| AP
%% Vulnerability chain
V[Vulnerability]:::cve -->|related-weakness| W
V -->|affects-cpe| P[Platform]:::cpe
V -.->|epss-score| ES((EPSS)):::epss
V -.->|kev| KE((KEV)):::kev
classDef attack fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
classDef capec fill:#fef3c7,stroke:#f59e0b,color:#78350f
classDef cwe fill:#fce7f3,stroke:#ec4899,color:#831843
classDef cve fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
classDef cpe fill:#e0e7ff,stroke:#6366f1,color:#312e81
classDef d3fend fill:#d1fae5,stroke:#10b981,color:#064e3b
classDef car fill:#fef9c3,stroke:#eab308,color:#713f12
classDef engage fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95
classDef atlas fill:#cffafe,stroke:#06b6d4,color:#164e63
classDef epss fill:#f3f4f6,stroke:#6b7280,color:#374151
classDef kev fill:#f3f4f6,stroke:#6b7280,color:#374151
Legend: Blue = ATT&CK · Amber = CAPEC · Pink = CWE · Red = CVE · Indigo = CPE · Green = D3FEND · Cyan = ATLAS · Yellow = CAR · Violet = ENGAGE
# Install dependencies
pip install -r requirements.txt
# Convert everything (all 15 sources) and produce combined.parquet
python src/convert.py
# Convert only ATT&CK
python src/convert.py --sources attack
# Convert a single ATT&CK domain
python src/convert.py --sources attack --domains enterprise
# Convert only CAPEC and CWE (skip others)
python src/convert.py --sources capec cwe
# Convert CVE, EPSS, and KEV together
python src/convert.py --sources cve epss kev
# Skip combined.parquet generation
python src/convert.py --no-combined
# Run individual converters standalone
python src/convert_attack.py
python src/convert_capec.py
python src/convert_cve.py
python src/convert_kev.py
# Use Parquet v1 format for backward compatibility (default is v2)
python src/convert.py --parquet-format v1Source files are cached in source/ by default. Files are versioned using Last-Modified or ETag headers and only re-downloaded when the source has been updated. Sources that don't provide version headers are always re-downloaded.
Output goes to output/:
| File | Source | Est. Triples |
|---|---|---|
enterprise.parquet |
ATT&CK Enterprise | ~42K |
mobile.parquet |
ATT&CK Mobile | ~5K |
ics.parquet |
ATT&CK ICS | ~4K |
attack-all.parquet |
ATT&CK combined (deduplicated) | ~50K |
capec.parquet |
CAPEC attack patterns | ~8K |
cwe.parquet |
CWE weaknesses | ~15K |
cve.parquet |
CVE vulnerabilities | ~1.5-3M |
cpe.parquet |
CPE platform enumeration | ~2-4M |
d3fend.parquet |
D3FEND defensive techniques | ~3K |
atlas.parquet |
ATLAS AI/ML techniques | ~3K |
car.parquet |
CAR analytics | ~2K |
engage.parquet |
ENGAGE adversary engagement | ~2K |
epss.parquet |
EPSS exploit prediction scores | ~650K |
kev.parquet |
KEV known exploited vulns | ~9K |
vulnrichment.parquet |
CISA Vulnrichment (SSVC, CVSS, CWE) | ~200-400K |
ghsa.parquet |
GitHub Security Advisories | ~20-40K |
sigma.parquet |
Sigma detection rules | ~20-40K |
exploitdb.parquet |
ExploitDB public exploits | ~300-500K |
combined.parquet |
All sources merged (deduplicated) | ~5-10M |
ATT&CK <──> CAPEC <──> CWE <──> CVE <──> CPE
^ ^
├── D3FEND (counters) ├── EPSS (scores)
├── ATLAS (AI parallel) ├── KEV (exploited)
├── CAR (detects) ├── Vulnrichment (SSVC/CVSS)
├── ENGAGE (engages) ├── GHSA (advisories)
└── Sigma (detects) ├── Sigma (related CVE)
└── ExploitDB (exploits)
# Unit tests (no network access required)
python -m pytest tests/ -v --ignore=tests/test_integration.py
# Integration tests (downloads real ATT&CK data)
python -m pytest tests/test_integration.py -v
# All tests
python -m pytest tests/ -vThe dataset is published at s0u9ata/security-kg on HuggingFace Hub and auto-updated weekly via GitHub Actions.
See the dataset card for schema details, example queries, and usage with the datasets library.
The following sources were researched and evaluated for inclusion. They are deferred for now but may be added in future versions.
| Source | Format | Why Deferred |
|---|---|---|
| MISP Galaxies | JSON | Excellent structure with ATT&CK mappings; 100+ galaxy clusters covering threat actors, tools, sectors. Deferred to keep initial scope manageable. |
| EUVD | JSON | EU vulnerability database, structured, CVE-linked. New (launched 2025), API still maturing. |
| OSV | JSON | Google's open-source vulnerability DB with bulk download. Focused on software packages rather than CVE-level vulnerabilities. |
| Source | Country | Status |
|---|---|---|
| JVN iPedia | Japan | RSS feeds available, CVE-linked, bilingual (JP/EN). Limited bulk structured data access. |
| ThaiCERT | Thailand | 504 APT group threat cards, structured. Niche coverage, limited API. |
| CNNVD / CNVD | China | Access restrictions for non-Chinese IPs, data quality concerns, significant latency vs NVD. |
| KrCERT / KNVD | South Korea | Limited public API, Korean-language only. |
| BSI | Germany | Advisories available, German-language, no bulk structured feed. |
| ANSSI | France | Advisories and IOC reports, French-language, limited machine-readable data. |
| CERT-In | India | CVE CNA, publishes advisories but no bulk structured data download. |
| AusCERT | Australia | RSS feeds available, English-language. Limited structured data beyond advisories. |
| CERT-EU | EU | Threat landscape reports, limited machine-readable data. |
| BDU (FSTEC) | Russia | Poor data quality, slow updates, access restrictions. |
| Source | Why Not Included |
|---|---|
| MAEC | Malware attribute enumeration. Sparse community adoption, limited structured data available. |
| OVAL | Compliance-focused XML definitions. Very large, focused on system configuration rather than threat context. |
| CCE | Configuration enumeration (Excel format). Narrow scope, limited cross-linking potential. |
Apache 2.0