This repository provides an experimental framework for training and evaluating machine learning models on tabular data augmented with LLM-based embeddings. The project systematically studies how different types of embeddings—text embeddings, Random Tree Embeddings (RTE), and their combinations with structured features—affect downstream performance on tabular classification tasks.
The focus is on modular experimentation: datasets, feature types, embedding models, concatenation strategies, and downstream classifiers can be flexibly combined through configuration files.
The current experimental setup includes:
-
Downstream models:
- Logistic Regression (LR)
- HistGradientBoostingClassifier (HGBDT)
-
Embedding sources:
- Text embeddings generated by 16 large language models
- Random Tree Embeddings (RTE)
-
Feature fusion / concatenation strategies:
- RTE embeddings + full tabular features
- Text embeddings + full tabular features
- Text embeddings + numerical (metrical) features
- Nominal text embeddings + numerical tabular features
These components can be selectively enabled or disabled via the configuration file.
-
Clone the repository:
git clone https://github.com/ml-lab-htw/tab-embeddings.git cd tab-embeddings -
(Optional but recommended) Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # macOS / Linux venv\\Scripts\\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
The framework supports both provided example datasets and custom user-defined datasets. To use a new dataset, it must be registered and preprocessed as described below.
Create a new directory under data/ named after your dataset:
data/
└── <dataset>/
├── X_<dataset>.csv
└── y_<dataset>.csv
X_<dataset>.csv: feature matrixy_<dataset>.csv: target labels
Update config/config.yaml by adding entries under both DATASETS and FEATURES.
Example:
DATASETS:
bank_churn:
path: ./data/<dataset>
X: "X_<dataset>.csv"
y: "y_<dataset>.csv"
X_metr: "X_<dataset>_metrics.csv"
X_nom: "X_<dataset>_nom.csv"
summaries: "<dataset>_summaries.txt"
nom_summaries: "<dataset>_nom_summaries.txt"
pca_components: 50
n_splits: 5
n_repeats: 1
FEATURES:
bank_churn:
nominal_features: [
'nom_feat_1',
'nom_feat_2'
]
text_features: ["text"]For each dataset, the following derived files must be created:
X_<dataset>_metrics.csv– numerical (metrical) features onlyX_<dataset>_nom.csv– nominal (categorical) features only<dataset>_summaries.txt– text summaries generated from the full dataset<dataset>_nom_summaries.txt– summaries generated from nominal features only
These files are required for the embedding and fusion experiments.
From the project root:
cd tab-embeddings
source venv/bin/activate # macOS / Linux
venv\\Scripts\\activate # Windows-
Split numerical and nominal features:
python -m src.main --config config/config.yaml split --dataset <dataset>
-
Generate summaries from the full dataset:
python -m src.main --config config/config.yaml summaries --dataset <dataset> --scope full
-
Generate summaries from nominal features only:
python -m src.main --config config/config.yaml summaries --dataset <dataset> --scope nominal
After these steps, the dataset is fully prepared for the experiment pipeline.
Experiments are controlled via a YAML configuration file. You may either:
- Modify
config/config.yaml, or - Create a custom configuration file by strictly following the structure of
config/config.py.
You can selectively disable:
- specific datasets
- embedding models
- downstream classifiers
- experiment types
by commenting them out in the configuration file.
For quick sanity checks, enable test mode in the configuration:
TEST_MODE: True
TEST_SAMPLES: 200In this mode:
- Only
TEST_LLM_KEYSare used - Only
TEST_EXPERIMENTSare executed - The number of samples is reduced
python src/main.py --config config/config.yamlIf you encounter a module resolution error, use:
python -m src.main --config config/config.yaml- Additional LLMs can be integrated for embedding generation
- New downstream models can be added
- New fusion strategies can be implemented
Detailed extension guidelines will be added in future versions.
This project is licensed under the MIT License. See the LICENSE file for details.
Oksana Kolomenko Ricardo Knauer Erik Rodner