This repository contains a Transformer-based classifier for identifying the programming language of short code snippets. The project is designed as an end-to-end, reproducible experiment that covers dataset preparation, tokenization, model design, training, and evaluation using modern deep learning techniques.
The implementation is provided as a Jupyter Notebook and is well-suited for experimentation, extension, and educational purposes, especially in the context of NLP-style models applied to source code.
Programming language classification from raw code snippets is a canonical task at the intersection of:
- Natural Language Processing (NLP)
- Machine Learning / Deep Learning
- Software Engineering and Code Analysis
In this project, code snippets are treated as sequences of tokens and processed using a Transformer encoder architecture. The trained model predicts one of several programming languages based solely on the snippet content.
This project uses a subset of the "GitHub Code Snippets" dataset originally published on Kaggle. The subset was created to make experimentation more lightweight while preserving class balance and diversity. You can find the link to the subset HERE
- Original dataset: "GitHub Code Snippets" (Kaggle)
- Dataset license: Attribution 4.0 International (CC BY 4.0)
Original Authors: Neeraj Kashyap, Andrey Dolgolev
Kaggle Owner: simiotic
.
├── transformer_code_snippet_classifier.ipynb
├── README.md
- The entire pipeline is implemented in a single Jupyter Notebook for clarity.
- The notebook is structured into clearly separated sections:
- Dataset Preparation
- Model Initialization
2.1 Tokenizer
2.2 Positional Embedding
2.3 Transformer Block
2.4 Model Building & Compilation - Model Training and Evaluation
3.1 Model Training
3.2 Metrics
The classifier is based on a Transformer encoder and follows the standard architecture used in sequence modeling:
- Token embedding layer with masking
- Positional encodding
- Multi-head self-attention
- Feed-forward MLP blocks
- Residual connections and normalization
- Final classification head
-
Preprocessing
- Tokenization of code snippets
- Padding and truncation to a fixed sequence length
- Encoding of language labels
-
Training
- Supervised learning with categorical cross-entropy
- Train/validation split
- Accuracy tracked per epoch
-
Evaluation
- Final evaluation on a held-out test set
- Visualization of training and validation accuracy with Confusionmatrix adn ROC-Curve
- Pretraining on larger code corpora
- Adding more programming languages
- Comparing Transformers to RNNs
- Applying transfer learning with pretrained code models
- Selfimplementation of Transformer concepts
Special thanks to the creators of the original GitHub Code Snippets dataset for making their work publicly available. This subset was created to lower the barrier of entry for experimenting with machine learning models on source code.
- Dataset: CC BY 4.0
- Code: MIT License