Skip to content

ml-lab-htw/tab-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tab-Embeddings: Learning on Tabular Data with LLM-Based Embeddings

Overview

This repository provides an experimental framework for training and evaluating machine learning models on tabular data augmented with LLM-based embeddings. The project systematically studies how different types of embeddings—text embeddings, Random Tree Embeddings (RTE), and their combinations with structured features—affect downstream performance on tabular classification tasks.

The focus is on modular experimentation: datasets, feature types, embedding models, concatenation strategies, and downstream classifiers can be flexibly combined through configuration files.


Implemented Experiments

The current experimental setup includes:

  • Downstream models:

    • Logistic Regression (LR)
    • HistGradientBoostingClassifier (HGBDT)
  • Embedding sources:

    • Text embeddings generated by 16 large language models
    • Random Tree Embeddings (RTE)
  • Feature fusion / concatenation strategies:

    • RTE embeddings + full tabular features
    • Text embeddings + full tabular features
    • Text embeddings + numerical (metrical) features
    • Nominal text embeddings + numerical tabular features

These components can be selectively enabled or disabled via the configuration file.


Installation

  1. Clone the repository:

    git clone https://github.com/ml-lab-htw/tab-embeddings.git
    cd tab-embeddings
  2. (Optional but recommended) Create and activate a virtual environment:

    python -m venv venv
    source venv/bin/activate   # macOS / Linux
    venv\\Scripts\\activate      # Windows
  3. Install dependencies:

    pip install -r requirements.txt

Data Preprocessing

The framework supports both provided example datasets and custom user-defined datasets. To use a new dataset, it must be registered and preprocessed as described below.

1. Add the Dataset

Create a new directory under data/ named after your dataset:

data/
└── <dataset>/
    ├── X_<dataset>.csv
    └── y_<dataset>.csv
  • X_<dataset>.csv: feature matrix
  • y_<dataset>.csv: target labels

2. Register the Dataset in the Configuration

Update config/config.yaml by adding entries under both DATASETS and FEATURES.

Example:

DATASETS:
  bank_churn:
    path: ./data/<dataset>
    X: "X_<dataset>.csv"
    y: "y_<dataset>.csv"
    X_metr: "X_<dataset>_metrics.csv"
    X_nom: "X_<dataset>_nom.csv"
    summaries: "<dataset>_summaries.txt"
    nom_summaries: "<dataset>_nom_summaries.txt"
    pca_components: 50
    n_splits: 5
    n_repeats: 1

FEATURES:
  bank_churn:
    nominal_features: [
      'nom_feat_1',
      'nom_feat_2'
    ]
    text_features: ["text"]

3. Generate Derived Data Files

For each dataset, the following derived files must be created:

  • X_<dataset>_metrics.csv – numerical (metrical) features only
  • X_<dataset>_nom.csv – nominal (categorical) features only
  • <dataset>_summaries.txt – text summaries generated from the full dataset
  • <dataset>_nom_summaries.txt – summaries generated from nominal features only

These files are required for the embedding and fusion experiments.


4. Run Preprocessing Commands

From the project root:

cd tab-embeddings
source venv/bin/activate   # macOS / Linux
venv\\Scripts\\activate      # Windows
  1. Split numerical and nominal features:

    python -m src.main --config config/config.yaml split --dataset <dataset>
  2. Generate summaries from the full dataset:

    python -m src.main --config config/config.yaml summaries --dataset <dataset> --scope full
  3. Generate summaries from nominal features only:

    python -m src.main --config config/config.yaml summaries --dataset <dataset> --scope nominal

After these steps, the dataset is fully prepared for the experiment pipeline.


Usage

Configuration

Experiments are controlled via a YAML configuration file. You may either:

  • Modify config/config.yaml, or
  • Create a custom configuration file by strictly following the structure of config/config.py.

You can selectively disable:

  • specific datasets
  • embedding models
  • downstream classifiers
  • experiment types

by commenting them out in the configuration file.


Test Mode

For quick sanity checks, enable test mode in the configuration:

TEST_MODE: True
TEST_SAMPLES: 200

In this mode:

  • Only TEST_LLM_KEYS are used
  • Only TEST_EXPERIMENTS are executed
  • The number of samples is reduced

Running the Experiments

python src/main.py --config config/config.yaml

If you encounter a module resolution error, use:

python -m src.main --config config/config.yaml

Extending the Framework

  • Additional LLMs can be integrated for embedding generation
  • New downstream models can be added
  • New fusion strategies can be implemented

Detailed extension guidelines will be added in future versions.


License

This project is licensed under the MIT License. See the LICENSE file for details.


Authors

Oksana Kolomenko Ricardo Knauer Erik Rodner

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages