A transformer-based text embedding model trained with contrastive learning (SimCSE approach) for generating high-quality sentence embeddings.
- Custom Transformer Architecture: Multi-head self-attention with position-wise feed-forward networks
- BPE Tokenizer: Trained from scratch on your data
- Contrastive Learning: SimCSE-style training for better embeddings
- CUDA Support: Automatic GPU acceleration when available
- Flexible Pooling: Mean, CLS token, or max pooling strategies
- Easy Inference: Simple API for embedding generation
SemanticCore/
├── configs/
│ └── config.yaml # Configuration file
├── data/
│ └── custom_tokenizer.json # Trained tokenizer (generated)
├── outputs/
│ ├── best_model.pt # Trained model (generated)
│ ├── training_history.png # Training curves (generated)
│ └── embeddings_tsne.png # t-SNE visualization (generated)
├── src/
│ ├── models/
│ │ ├── attention.py # Multi-head attention
│ │ ├── encoder.py # Transformer encoder block
│ │ ├── feed_forward.py # Feed-forward network
│ │ ├── positional_encoding.py # Positional encoding
│ │ └── transformer_model.py # Main model
│ ├── utils/
│ │ ├── config.py # Config utilities
│ │ └── evaluation.py # Visualization & evaluation
│ ├── dataset.py # Dataset class
│ ├── losses.py # Contrastive loss functions
│ ├── tokenizer.py # Tokenizer training
│ └── train.py # Training loop
├── main.py # Main training script
├── pyproject.toml # Dependencies
└── README.md # This file
- Install dependencies:
uv syncOr manually:
pip install tokenizers datasets tqdm numpy matplotlib scikit-learn pyyaml- Install CUDA-enabled torch
uv run pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126- Configure your settings in
configs/config.yaml - Run training:
python main.pyThe script will:
- Load the AG News dataset
- Train a BPE tokenizer
- Create and train the transformer model
- Save checkpoints and visualizations
- Automatically use CUDA if available
Edit configs/config.yaml to customize:
model:
d_model: 256 # Embedding dimension
num_heads: 8 # Attention heads
num_layers: 4 # Transformer layers
training:
batch_size: 64
learning_rate: 0.0003
num_epochs: 5
device:
use_cuda: true # Enable GPU trainingfrom src.tokenizer import TokenizerTrainer
from src.models import TransformerEmbeddingModel
from src.train import EmbeddingTrainer
from src.dataset import TextDataset
# Train tokenizer
tokenizer_trainer = TokenizerTrainer(vocab_size=10000)
tokenizer = tokenizer_trainer.train(texts, save_path="tokenizer.json")
# Create model
model = TransformerEmbeddingModel(
vocab_size=tokenizer.get_vocab_size(),
d_model=256,
num_heads=8,
num_layers=4
)
# Train
trainer = EmbeddingTrainer(model, tokenizer, train_dataset)
trainer.train(num_epochs=5)from src.utils import SentenceEmbedder
# Load trained model
embedder = SentenceEmbedder(
model_path='outputs/best_model.pt',
tokenizer_path='data/custom_tokenizer.json'
)
# Generate embeddings
texts = ["This is a sentence.", "Another sentence."]
embeddings = embedder.encode(texts)
# Compute similarity
similarity = embedder.similarity("Hello world", "Hi there")
print(f"Similarity: {similarity:.4f}")Input Text
↓
[Tokenizer]
↓
Token Embeddings (vocab_size → d_model)
↓
Positional Encoding
↓
[Transformer Encoder Blocks] × N
│ ├── Multi-Head Self-Attention
│ ├── Layer Normalization
│ ├── Feed-Forward Network
│ └── Layer Normalization
↓
Pooling (Mean/CLS/Max)
↓
Sentence Embedding (d_model)
The plot shows the training and validation loss over epochs, along with the learning rate schedule with warmup and decay.
t-SNE visualization of learned embeddings colored by category. Similar texts cluster together, demonstrating the model's ability to capture semantic relationships.
- Loss: Contrastive Loss (SimCSE)
- Optimizer: AdamW with weight decay
- Learning Rate: Linear warmup + decay
- Gradient Clipping: Max norm 1.0
- Augmentation: Dropout-based (same text, different dropout masks)
- Attention Is All You Need - Original Transformer
- SimCSE - Contrastive Learning for Sentence Embeddings
- Sentence-BERT - Sentence Embeddings using Siamese Networks

