Transformer-based Programming Language Classification

This repository contains a Transformer-based classifier for identifying the programming language of short code snippets. The project is designed as an end-to-end, reproducible experiment that covers dataset preparation, tokenization, model design, training, and evaluation using modern deep learning techniques.

The implementation is provided as a Jupyter Notebook and is well-suited for experimentation, extension, and educational purposes, especially in the context of NLP-style models applied to source code.

Overview

Programming language classification from raw code snippets is a canonical task at the intersection of:

Natural Language Processing (NLP)
Machine Learning / Deep Learning
Software Engineering and Code Analysis

In this project, code snippets are treated as sequences of tokens and processed using a Transformer encoder architecture. The trained model predicts one of several programming languages based solely on the snippet content.

Dataset

Description

This project uses a subset of the "GitHub Code Snippets" dataset originally published on Kaggle. The subset was created to make experimentation more lightweight while preserving class balance and diversity. You can find the link to the subset HERE

Original dataset: "GitHub Code Snippets" (Kaggle)
Dataset license: Attribution 4.0 International (CC BY 4.0)

Original Authors: Neeraj Kashyap, Andrey Dolgolev
Kaggle Owner: simiotic

Project Structure

.
├── transformer_code_snippet_classifier.ipynb
├── README.md

The entire pipeline is implemented in a single Jupyter Notebook for clarity.
The notebook is structured into clearly separated sections:

Dataset Preparation
Model Initialization
2.1 Tokenizer
2.2 Positional Embedding
2.3 Transformer Block
2.4 Model Building & Compilation
Model Training and Evaluation
3.1 Model Training
3.2 Metrics

Model Architecture

The classifier is based on a Transformer encoder and follows the standard architecture used in sequence modeling:

Token embedding layer with masking
Positional encodding
Multi-head self-attention
Feed-forward MLP blocks
Residual connections and normalization
Final classification head

Training Pipeline

Preprocessing
- Tokenization of code snippets
- Padding and truncation to a fixed sequence length
- Encoding of language labels
Training
- Supervised learning with categorical cross-entropy
- Train/validation split
- Accuracy tracked per epoch
Evaluation
- Final evaluation on a held-out test set
- Visualization of training and validation accuracy with Confusionmatrix adn ROC-Curve

Possible Extensions

Pretraining on larger code corpora
Adding more programming languages
Comparing Transformers to RNNs
Applying transfer learning with pretrained code models
Selfimplementation of Transformer concepts

Acknowledgments

Special thanks to the creators of the original GitHub Code Snippets dataset for making their work publicly available. This subset was created to lower the barrier of entry for experimenting with machine learning models on source code.

License

Dataset: CC BY 4.0
Code: MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
transformer_code_snippet_classifier.ipynb		transformer_code_snippet_classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-based Programming Language Classification

Overview

Dataset

Description

Project Structure

Model Architecture

Training Pipeline

Possible Extensions

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer-based Programming Language Classification

Overview

Dataset

Description

Project Structure

Model Architecture

Training Pipeline

Possible Extensions

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages