Generalized Topic Model (GTM)

GTM is a neural topic modeling framework for large multimodal/multilingual corpora.

It supports:

Exploratory analysis of latent topics
Supervised prediction using topic features
Causal modeling via structured priors and metadata

Key Features

Multilingual and multimodal support
Flexible metadata handling:
- prevalence: influences topic choice
- content: alters topic content conditioned on topic
- labels: for classification or regression tasks
- prediction: additional predictors for labels
Input representations:
- Document embeddings
- Word frequency (BoW)

🚀 Getting Started

1. Build the Dataset with `GTMCorpus()`

Prepares your corpus and metadata.

✅ Supported Inputs:

Metadata (optional):
- prevalence, content, labels, prediction
Multimodal views:

modalities = {
    "text": {
        "column": "doc_clean",
        "views": {
            "bow": {
                "type": "bow",
                "vectorizer": CountVectorizer()
            }
        }
    },
    "image": {
        "column": "image_path",
        "views": {
            "embedding": {
                "type": "embedding",
                "embed_fn": my_image_embedder
            }
        }
    }
}

2. Train the GTM Model

model = GTM(...)

🔧 Core Options:

n_topics: number of latent topics
doc_topic_prior:
- "dirichlet" (sparse, interpretable)
- "logistic_normal" (flexible, use with vae)

⚖️ Loss Weights:

w_prior: how much metadata influences topics
w_pred_loss: weight of supervised loss (if using labels)

📐 Structured Priors:

Set update_prior=True to condition topic priors on prevalence covariates

🧬 Autoencoder Type:

"wae": Wasserstein Autoencoder (default, stable)
"vae": Variational Autoencoder
- Use with doc_topic_prior="logistic_normal"

🔁 KL Annealing (VAE only):

Prevents posterior collapse and encourages meaningful topics:

kl_annealing_start = 0
kl_annealing_end = 1000
kl_annealing_max_beta = 1.0

3. Explore and Analyze Topics

📝 Topic Inspection:

get_topic_words() — top words per topic
get_covariate_words() — word shifts by content covariates
get_top_docs() — representative docs per topic

📈 Metadata Effects:

estimate_effect() — topic prevalence regression (linear)

🖼️ Visualizations:

plot_topic_word_distribution() — word clouds / bar plots
visualize_docs() — 2D projection (UMAP, t-SNE, PCA)
visualize_words() — semantic word embeddings
visualize_topics() — semantic topic embeddings

🎯 Supervised Prediction:

get_predictions() — returns classification or regression outputs (if labels were used)

📚 Tutorials

Get started with example notebooks in notebooks/.

The dataset used in these notebooks can be downloaded here and should be placed in the data folder.

📖 References

Deep Latent Variable Models for Unstructured Data (PDF)
Germain Gauthier, Philine Widmer, and Elliott Ash
generalized_topic_models: A Python Package to Estimate Neural Topic Models (PDF)
Germain Gauthier, Philine Widmer, and Elliott Ash

⚠️ Disclaimer

This package is under active development 🚧 — feedback and contributions are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
gtm		gtm
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generalized Topic Model (GTM)

Key Features

🚀 Getting Started

1. Build the Dataset with `GTMCorpus()`

✅ Supported Inputs:

2. Train the GTM Model

🔧 Core Options:

⚖️ Loss Weights:

📐 Structured Priors:

🧬 Autoencoder Type:

🔁 KL Annealing (VAE only):

3. Explore and Analyze Topics

📝 Topic Inspection:

📈 Metadata Effects:

🖼️ Visualizations:

🎯 Supervised Prediction:

📚 Tutorials

📖 References

⚠️ Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generalized Topic Model (GTM)

Key Features

🚀 Getting Started

1. Build the Dataset with GTMCorpus()

✅ Supported Inputs:

2. Train the GTM Model

🔧 Core Options:

⚖️ Loss Weights:

📐 Structured Priors:

🧬 Autoencoder Type:

🔁 KL Annealing (VAE only):

3. Explore and Analyze Topics

📝 Topic Inspection:

📈 Metadata Effects:

🖼️ Visualizations:

🎯 Supervised Prediction:

📚 Tutorials

📖 References

⚠️ Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Build the Dataset with `GTMCorpus()`

Packages