Skip to content

SementicCore: A transformer-based text embedding model trained with contrastive learning (SimCSE approach) for generating high-quality sentence embeddings.

Notifications You must be signed in to change notification settings

Abhi-vish/SemanticCore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SemanticCore

A transformer-based text embedding model trained with contrastive learning (SimCSE approach) for generating high-quality sentence embeddings.

🎯 Features

  • Custom Transformer Architecture: Multi-head self-attention with position-wise feed-forward networks
  • BPE Tokenizer: Trained from scratch on your data
  • Contrastive Learning: SimCSE-style training for better embeddings
  • CUDA Support: Automatic GPU acceleration when available
  • Flexible Pooling: Mean, CLS token, or max pooling strategies
  • Easy Inference: Simple API for embedding generation

Project Structure

SemanticCore/
├── configs/
│   └── config.yaml              # Configuration file
├── data/
│   └── custom_tokenizer.json    # Trained tokenizer (generated)
├── outputs/
│   ├── best_model.pt            # Trained model (generated)
│   ├── training_history.png     # Training curves (generated)
│   └── embeddings_tsne.png      # t-SNE visualization (generated)
├── src/
│   ├── models/
│   │   ├── attention.py         # Multi-head attention
│   │   ├── encoder.py           # Transformer encoder block
│   │   ├── feed_forward.py      # Feed-forward network
│   │   ├── positional_encoding.py  # Positional encoding
│   │   └── transformer_model.py # Main model
│   ├── utils/
│   │   ├── config.py            # Config utilities
│   │   └── evaluation.py        # Visualization & evaluation
│   ├── dataset.py               # Dataset class
│   ├── losses.py                # Contrastive loss functions
│   ├── tokenizer.py             # Tokenizer training
│   └── train.py                 # Training loop
├── main.py                      # Main training script
├── pyproject.toml               # Dependencies
└── README.md                    # This file

🚀 Quick Start

Installation

  1. Install dependencies:
uv sync

Or manually:

pip install tokenizers datasets tqdm numpy matplotlib scikit-learn pyyaml
  1. Install CUDA-enabled torch
uv run pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126

Training

  1. Configure your settings in configs/config.yaml
  2. Run training:
python main.py

The script will:

  • Load the AG News dataset
  • Train a BPE tokenizer
  • Create and train the transformer model
  • Save checkpoints and visualizations
  • Automatically use CUDA if available

⚙️ Configuration

Edit configs/config.yaml to customize:

model:
  d_model: 256              # Embedding dimension
  num_heads: 8              # Attention heads
  num_layers: 4             # Transformer layers
  
training:
  batch_size: 64
  learning_rate: 0.0003
  num_epochs: 5
  
device:
  use_cuda: true            # Enable GPU training

💻 Usage

Training the Model

from src.tokenizer import TokenizerTrainer
from src.models import TransformerEmbeddingModel
from src.train import EmbeddingTrainer
from src.dataset import TextDataset

# Train tokenizer
tokenizer_trainer = TokenizerTrainer(vocab_size=10000)
tokenizer = tokenizer_trainer.train(texts, save_path="tokenizer.json")

# Create model
model = TransformerEmbeddingModel(
    vocab_size=tokenizer.get_vocab_size(),
    d_model=256,
    num_heads=8,
    num_layers=4
)

# Train
trainer = EmbeddingTrainer(model, tokenizer, train_dataset)
trainer.train(num_epochs=5)

Using Trained Model

from src.utils import SentenceEmbedder

# Load trained model
embedder = SentenceEmbedder(
    model_path='outputs/best_model.pt',
    tokenizer_path='data/custom_tokenizer.json'
)

# Generate embeddings
texts = ["This is a sentence.", "Another sentence."]
embeddings = embedder.encode(texts)

# Compute similarity
similarity = embedder.similarity("Hello world", "Hi there")
print(f"Similarity: {similarity:.4f}")

🔬 Model Architecture

Input Text
    ↓
[Tokenizer]
    ↓
Token Embeddings (vocab_size → d_model)
    ↓
Positional Encoding
    ↓
[Transformer Encoder Blocks] × N
│   ├── Multi-Head Self-Attention
│   ├── Layer Normalization
│   ├── Feed-Forward Network
│   └── Layer Normalization
    ↓
Pooling (Mean/CLS/Max)
    ↓
Sentence Embedding (d_model)

📊 Training Results

Training History

Training History

The plot shows the training and validation loss over epochs, along with the learning rate schedule with warmup and decay.

t-SNE Visualization

t-SNE Embeddings

t-SNE visualization of learned embeddings colored by category. Similar texts cluster together, demonstrating the model's ability to capture semantic relationships.

Training Details

  • Loss: Contrastive Loss (SimCSE)
  • Optimizer: AdamW with weight decay
  • Learning Rate: Linear warmup + decay
  • Gradient Clipping: Max norm 1.0
  • Augmentation: Dropout-based (same text, different dropout masks)

🎓 References

About

SementicCore: A transformer-based text embedding model trained with contrastive learning (SimCSE approach) for generating high-quality sentence embeddings.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages