GTM is a neural topic modeling framework for large multimodal/multilingual corpora.
It supports:
- Exploratory analysis of latent topics
- Supervised prediction using topic features
- Causal modeling via structured priors and metadata
- Multilingual and multimodal support
- Flexible metadata handling:
prevalence: influences topic choicecontent: alters topic content conditioned on topiclabels: for classification or regression tasksprediction: additional predictors for labels
- Input representations:
- Document embeddings
- Word frequency (BoW)
Prepares your corpus and metadata.
- Metadata (optional):
prevalence,content,labels,prediction
- Multimodal views:
modalities = {
"text": {
"column": "doc_clean",
"views": {
"bow": {
"type": "bow",
"vectorizer": CountVectorizer()
}
}
},
"image": {
"column": "image_path",
"views": {
"embedding": {
"type": "embedding",
"embed_fn": my_image_embedder
}
}
}
}model = GTM(...)n_topics: number of latent topicsdoc_topic_prior:"dirichlet"(sparse, interpretable)"logistic_normal"(flexible, use withvae)
w_prior: how much metadata influences topicsw_pred_loss: weight of supervised loss (if usinglabels)
- Set
update_prior=Trueto condition topic priors onprevalencecovariates
"wae": Wasserstein Autoencoder (default, stable)"vae": Variational Autoencoder- Use with
doc_topic_prior="logistic_normal"
- Use with
Prevents posterior collapse and encourages meaningful topics:
kl_annealing_start = 0
kl_annealing_end = 1000
kl_annealing_max_beta = 1.0get_topic_words()— top words per topicget_covariate_words()— word shifts bycontentcovariatesget_top_docs()— representative docs per topic
estimate_effect()— topic prevalence regression (linear)
plot_topic_word_distribution()— word clouds / bar plotsvisualize_docs()— 2D projection (UMAP, t-SNE, PCA)visualize_words()— semantic word embeddingsvisualize_topics()— semantic topic embeddings
get_predictions()— returns classification or regression outputs (iflabelswere used)
Get started with example notebooks in notebooks/.
The dataset used in these notebooks can be downloaded here and should be placed in the data folder.
-
Deep Latent Variable Models for Unstructured Data (PDF)
Germain Gauthier, Philine Widmer, and Elliott Ash -
generalized_topic_models: A Python Package to Estimate Neural Topic Models (PDF)
Germain Gauthier, Philine Widmer, and Elliott Ash
This package is under active development 🚧 — feedback and contributions are welcome!