Tab-Embeddings: Learning on Tabular Data with LLM-Based Embeddings

Overview

This repository provides an experimental framework for training and evaluating machine learning models on tabular data augmented with LLM-based embeddings. The project systematically studies how different types of embeddings—text embeddings, Random Tree Embeddings (RTE), and their combinations with structured features—affect downstream performance on tabular classification tasks.

The focus is on modular experimentation: datasets, feature types, embedding models, concatenation strategies, and downstream classifiers can be flexibly combined through configuration files.

Implemented Experiments

The current experimental setup includes:

Downstream models:
- Logistic Regression (LR)
- HistGradientBoostingClassifier (HGBDT)
Embedding sources:
- Text embeddings generated by 16 large language models
- Random Tree Embeddings (RTE)
Feature fusion / concatenation strategies:
- RTE embeddings + full tabular features
- Text embeddings + full tabular features
- Text embeddings + numerical (metrical) features
- Nominal text embeddings + numerical tabular features

These components can be selectively enabled or disabled via the configuration file.

Installation

Clone the repository:

git clone https://github.com/ml-lab-htw/tab-embeddings.git
cd tab-embeddings

(Optional but recommended) Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   # macOS / Linux
venv\\Scripts\\activate      # Windows

Install dependencies:
```
pip install -r requirements.txt
```

Data Preprocessing

The framework supports both provided example datasets and custom user-defined datasets. To use a new dataset, it must be registered and preprocessed as described below.

1. Add the Dataset

Create a new directory under data/ named after your dataset:

data/
└── <dataset>/
    ├── X_<dataset>.csv
    └── y_<dataset>.csv

X_<dataset>.csv: feature matrix
y_<dataset>.csv: target labels

2. Register the Dataset in the Configuration

Update config/config.yaml by adding entries under both DATASETS and FEATURES.

Example:

DATASETS:
  bank_churn:
    path: ./data/<dataset>
    X: "X_<dataset>.csv"
    y: "y_<dataset>.csv"
    X_metr: "X_<dataset>_metrics.csv"
    X_nom: "X_<dataset>_nom.csv"
    summaries: "<dataset>_summaries.txt"
    nom_summaries: "<dataset>_nom_summaries.txt"
    pca_components: 50
    n_splits: 5
    n_repeats: 1

FEATURES:
  bank_churn:
    nominal_features: [
      'nom_feat_1',
      'nom_feat_2'
    ]
    text_features: ["text"]

3. Generate Derived Data Files

For each dataset, the following derived files must be created:

X_<dataset>_metrics.csv – numerical (metrical) features only
X_<dataset>_nom.csv – nominal (categorical) features only
<dataset>_summaries.txt – text summaries generated from the full dataset
<dataset>_nom_summaries.txt – summaries generated from nominal features only

These files are required for the embedding and fusion experiments.

4. Run Preprocessing Commands

From the project root:

cd tab-embeddings
source venv/bin/activate   # macOS / Linux
venv\\Scripts\\activate      # Windows

Split numerical and nominal features:

python -m src.main --config config/config.yaml split --dataset <dataset>

Generate summaries from the full dataset:

python -m src.main --config config/config.yaml summaries --dataset <dataset> --scope full

Generate summaries from nominal features only:

python -m src.main --config config/config.yaml summaries --dataset <dataset> --scope nominal

After these steps, the dataset is fully prepared for the experiment pipeline.

Usage

Configuration

Experiments are controlled via a YAML configuration file. You may either:

Modify config/config.yaml, or
Create a custom configuration file by strictly following the structure of config/config.py.

You can selectively disable:

specific datasets
embedding models
downstream classifiers
experiment types

by commenting them out in the configuration file.

Test Mode

For quick sanity checks, enable test mode in the configuration:

TEST_MODE: True
TEST_SAMPLES: 200

In this mode:

Only TEST_LLM_KEYS are used
Only TEST_EXPERIMENTS are executed
The number of samples is reduced

Running the Experiments

python src/main.py --config config/config.yaml

If you encounter a module resolution error, use:

python -m src.main --config config/config.yaml

Extending the Framework

Additional LLMs can be integrated for embedding generation
New downstream models can be added
New fusion strategies can be implemented

Detailed extension guidelines will be added in future versions.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

Oksana Kolomenko Ricardo Knauer Erik Rodner

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
config		config
data		data
src		src
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tab-Embeddings: Learning on Tabular Data with LLM-Based Embeddings

Overview

Implemented Experiments

Installation

Data Preprocessing

1. Add the Dataset

2. Register the Dataset in the Configuration

3. Generate Derived Data Files

4. Run Preprocessing Commands

Usage

Configuration

Test Mode

Running the Experiments

Extending the Framework

License

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tab-Embeddings: Learning on Tabular Data with LLM-Based Embeddings

Overview

Implemented Experiments

Installation

Data Preprocessing

1. Add the Dataset

2. Register the Dataset in the Configuration

3. Generate Derived Data Files

4. Run Preprocessing Commands

Usage

Configuration

Test Mode

Running the Experiments

Extending the Framework

License

Authors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages