Skip to content

budermike/transformer_code_snippet_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Transformer-based Programming Language Classification

This repository contains a Transformer-based classifier for identifying the programming language of short code snippets. The project is designed as an end-to-end, reproducible experiment that covers dataset preparation, tokenization, model design, training, and evaluation using modern deep learning techniques.

The implementation is provided as a Jupyter Notebook and is well-suited for experimentation, extension, and educational purposes, especially in the context of NLP-style models applied to source code.

Overview

Programming language classification from raw code snippets is a canonical task at the intersection of:

  • Natural Language Processing (NLP)
  • Machine Learning / Deep Learning
  • Software Engineering and Code Analysis

In this project, code snippets are treated as sequences of tokens and processed using a Transformer encoder architecture. The trained model predicts one of several programming languages based solely on the snippet content.

Dataset

Description

This project uses a subset of the "GitHub Code Snippets" dataset originally published on Kaggle. The subset was created to make experimentation more lightweight while preserving class balance and diversity. You can find the link to the subset HERE

Original Authors: Neeraj Kashyap, Andrey Dolgolev
Kaggle Owner: simiotic

Project Structure

.
├── transformer_code_snippet_classifier.ipynb
├── README.md
  • The entire pipeline is implemented in a single Jupyter Notebook for clarity.
  • The notebook is structured into clearly separated sections:
  1. Dataset Preparation
  2. Model Initialization
    2.1 Tokenizer
    2.2 Positional Embedding
    2.3 Transformer Block
    2.4 Model Building & Compilation
  3. Model Training and Evaluation
    3.1 Model Training
    3.2 Metrics

Model Architecture

The classifier is based on a Transformer encoder and follows the standard architecture used in sequence modeling:

  • Token embedding layer with masking
  • Positional encodding
  • Multi-head self-attention
  • Feed-forward MLP blocks
  • Residual connections and normalization
  • Final classification head

Training Pipeline

  1. Preprocessing

    • Tokenization of code snippets
    • Padding and truncation to a fixed sequence length
    • Encoding of language labels
  2. Training

    • Supervised learning with categorical cross-entropy
    • Train/validation split
    • Accuracy tracked per epoch
  3. Evaluation

    • Final evaluation on a held-out test set
    • Visualization of training and validation accuracy with Confusionmatrix adn ROC-Curve

Possible Extensions

  • Pretraining on larger code corpora
  • Adding more programming languages
  • Comparing Transformers to RNNs
  • Applying transfer learning with pretrained code models
  • Selfimplementation of Transformer concepts

Acknowledgments

Special thanks to the creators of the original GitHub Code Snippets dataset for making their work publicly available. This subset was created to lower the barrier of entry for experimenting with machine learning models on source code.

License

  • Dataset: CC BY 4.0
  • Code: MIT License

About

End-to-end Transformer model for programming language classification on GitHub code snippets, including preprocessing, training, and evaluation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors