A deep learning approach for automated vulnerability detection in C/C++ and Java source code using Relational Graph Convolutional Networks (RGCN) applied to Code Property Graphs (CPGs).
This project implements a novel vulnerability detection system that:
- Extracts Code Property Graphs (CPGs) from source code using Joern
- Identifies vulnerable code patterns through statistical analysis
- Builds 1-hop subgraphs around vulnerability center nodes
- Learns semantic representations using Word2Vec on tokenized code
- Classifies vulnerable vs. non-vulnerable code using RGCN neural networks
The system achieves strong performance with 97.32% F1-score on C/C++ datasets and 82.51% F1-score on Java datasets.
Source Code → CPG Generation → Center Node Selection → Subgraph Extraction →
Tokenization → Word2Vec Embedding → RGCN Training → Vulnerability Classification
- CPG Generation: Uses Joern to extract structural representations
- Vulnerability Analysis: Statistical identification of malicious code patterns
- Subgraph Building: 1-hop neighborhood extraction around center nodes
- Feature Learning: Word2Vec embeddings for semantic code representation
- RGCN Classification: Graph neural network for vulnerability detection
- Nodes: Code elements (variables, function calls, control structures)
- Edges: Three types - AST (syntax), CFG (control flow), DDG (data dependencies)
- Features: Concatenation of vulnerability type embeddings and Word2Vec code vectors
- Input: Subgraph with node features (vulnerability type + code embeddings)
- RGCN Layers: Two relational convolution layers with 128 hidden dimensions
- Aggregation: Global mean pooling over subgraph nodes
- Output: Binary classification (vulnerable/non-vulnerable)
- Optimizer: Adam with learning rate 1e-3
- Regularization: Dropout (0.5) + Weight decay (5e-4)
- Early Stopping: Patience of 10 epochs on validation loss
- Data Split: 80% train, 10% validation, 10% test
- Operating System: Linux/WSL (required for Joern)
- Python: 3.8+ with uv package manager
- Java: JDK 17 or later
- Memory: 8GB+ RAM recommended for large datasets
# Clone repository
git clone <repository-url>
cd ML-Project
# Create Python environment using uv
uv venv
source .venv/bin/activate # Linux/Mac
# .venv\Scripts\activate # Windows
# Install dependencies
uv pip install torch torch-geometric gensim scikit-learn pandas numpy tqdm pydot networkx matplotlib requests# Download Joern installation script
curl -L "https://github.com/joernio/joern/releases/latest/download/joern-install.sh" -o joern-install.sh
chmod +x joern-install.sh
# Install Joern
./joern-install.sh
# Add to PATH
echo 'export PATH="$PATH:$HOME/joern/joern-cli"' >> ~/.bashrc
source ~/.bashrc
# Verify installation
joern --version# Install OpenJDK 17
sudo apt update
sudo apt install openjdk-17-jdk
# Set JAVA_HOME
echo 'export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64' >> ~/.bashrc
source ~/.bashrc
# Verify installation
java -versionFollow these steps to run the full vulnerability detection pipeline:
Download vulnerable and non-vulnerable code samples from SARD dataset:
python get_samples.pyThis downloads samples into data_{language}/ directories with separate folders for source code, JSON metadata, and compressed files.
Generate Code Property Graphs using Joern:
# For C/C++ code
chmod +x process_c_cpgs.sh
./process_c_cpgs.sh
# For Java code
chmod +x process_java_cpgs.sh
./process_java_cpgs.shOutput: CPG files in DOT format under data_{language}/cpg-output/
Create vulnerability characteristic mapping:
python create_vuln_char_table.pyOutput: vuln-char-table-final.csv with vulnerability type mappings
Identify nodes representing vulnerability patterns:
python select_centernode.pyOutput: center_nodes_result.json with selected vulnerability-relevant nodes
Build 1-hop subgraphs and tokenize code:
python subgraph_building_and_tokenizing.pyOutput:
subgraph_contexts/: Raw subgraph filestokenized_contexts/: Processed and tokenized code
Train semantic embeddings on tokenized code:
python word2vec.pyOutput: model/word2vec.model with 512-dimensional code embeddings
Convert subgraphs to PyTorch Geometric format:
python subgraph_embedding.pyOutput: processed_subgraphs/all_subgraphs_pyg.pt ready for training
Train the RGCN vulnerability classifier:
python train.pyOutput:
best_rgcn.pt: Best model checkpoint- Training logs with validation metrics
- Final test set evaluation
Key parameters can be modified in individual scripts:
- Language: Set
LANGUAGE = "cpp"or"java"in scripts - Sample Size: Modify limits in
get_samples.py - Model Architecture: Adjust hidden dimensions, layers in
train.py - Training: Change learning rate, epochs, regularization in
train.py
| Dataset | Class | Precision | Recall | F1-Score |
|---|---|---|---|---|
| C/C++ | 0 (Safe) | 84.10% | 90.38% | 87.13% |
| C/C++ | 1 (Vuln) | 98.05% | 96.59% | 97.32% |
| Java | 0 (Safe) | 62.85% | 87.65% | 73.21% |
| Java | 1 (Vuln) | 92.45% | 74.49% | 82.51% |
- C/C++: Low false positive rate (1.95%) and false negative rate (3.41%)
- Java: Higher false positive rate (7.55%) due to language complexity
- Overall: Strong performance on vulnerability detection (Class 1)
Generated performance charts include:
performance_comparison_overall.png: Complete metrics comparisonprecision_comparison.png: Precision across datasetsrecall_comparison.png: Recall analysisf1_score_comparison.png: F1-score comparisonclass0_error_rates.png&class1_error_rates.png: Error rate analysis
ML-Project/
├── README.md # This documentation
├── get_samples.py # SARD dataset collection
├── process_c_cpgs.sh # C/C++ CPG generation
├── process_java_cpgs.sh # Java CPG generation
├── create_vuln_char_table.py # Vulnerability mapping
├── select_centernode.py # Center node identification
├── subgraph_building_and_tokenizing.py # Subgraph extraction
├── word2vec.py # Semantic embedding training
├── subgraph_embedding.py # Graph data preparation
├── train.py # RGCN model training
├── plot.py # Performance visualization
├── feature_learning.py # Feature extraction utilities
├── extractToken.py # Code tokenization utilities
├── data_c/ # C/C++ dataset and outputs
├── data_cpp/ # C++ specific data
├── data_java/ # Java dataset and outputs
├── torch-rgcn/ # RGCN implementation library
└── docs/ # Additional documentation
The system identifies these vulnerability patterns:
- Function calls: Potentially unsafe API usage
- Memory operations: malloc/free, buffer operations
- Type operations: Casting, type checking
- Control structures: Conditional logic, loops
- Data access: Array indexing, field access
- Assignment operations: Variable modifications
- Architecture: Relational GCN with message passing
- Edge Types: AST (type=2), CFG (type=1), DDG (type=0)
- Aggregation: Neighbor feature averaging with edge-type weighting
- Activation: ReLU between layers
- Output: Softmax classification over vulnerability classes
- Edge Index Validation: Ensures graph connectivity integrity
- Feature Consistency: Validates node feature dimensions
- Label Distribution: Balanced sampling for training stability
Joern Installation Problems:
# Ensure Java 17+ is installed and JAVA_HOME is set
java -version
echo $JAVA_HOME
# Check Joern binary permissions
ls -la $(which joern)Memory Issues During Processing:
# Increase Java heap size
export JAVA_OPTS="-Xmx8g"
# Process smaller batches
# Modify batch sizes in processing scriptsPyTorch Geometric Installation:
# Install with specific CUDA version if needed
uv pip install torch-geometric -f https://data.pyg.org/whl/torch-2.0.0+cpu.htmlMissing Dependencies:
# Install system dependencies
sudo apt install build-essential python3-dev
# Reinstall Python packages
uv pip install --force-reinstall torch torch-geometric- Joern: Code analysis platform for CPG generation
- PyTorch Geometric: Graph neural network framework
- SARD Dataset: NIST Software Assurance Reference Dataset
- Gensim: Word2Vec implementation for semantic embeddings