Skip to content

PinchOfData/generalized_topic_model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generalized Topic Model (GTM)

GTM is a neural topic modeling framework for large multimodal/multilingual corpora.

It supports:

  • Exploratory analysis of latent topics
  • Supervised prediction using topic features
  • Causal modeling via structured priors and metadata

Key Features

  • Multilingual and multimodal support
  • Flexible metadata handling:
    • prevalence: influences topic choice
    • content: alters topic content conditioned on topic
    • labels: for classification or regression tasks
    • prediction: additional predictors for labels
  • Input representations:
    • Document embeddings
    • Word frequency (BoW)

🚀 Getting Started

1. Build the Dataset with GTMCorpus()

Prepares your corpus and metadata.

✅ Supported Inputs:

  • Metadata (optional):
    • prevalence, content, labels, prediction
  • Multimodal views:
modalities = {
    "text": {
        "column": "doc_clean",
        "views": {
            "bow": {
                "type": "bow",
                "vectorizer": CountVectorizer()
            }
        }
    },
    "image": {
        "column": "image_path",
        "views": {
            "embedding": {
                "type": "embedding",
                "embed_fn": my_image_embedder
            }
        }
    }
}

2. Train the GTM Model

model = GTM(...)

🔧 Core Options:

  • n_topics: number of latent topics
  • doc_topic_prior:
    • "dirichlet" (sparse, interpretable)
    • "logistic_normal" (flexible, use with vae)

⚖️ Loss Weights:

  • w_prior: how much metadata influences topics
  • w_pred_loss: weight of supervised loss (if using labels)

📐 Structured Priors:

  • Set update_prior=True to condition topic priors on prevalence covariates

🧬 Autoencoder Type:

  • "wae": Wasserstein Autoencoder (default, stable)
  • "vae": Variational Autoencoder
    • Use with doc_topic_prior="logistic_normal"

🔁 KL Annealing (VAE only):

Prevents posterior collapse and encourages meaningful topics:

kl_annealing_start = 0
kl_annealing_end = 1000
kl_annealing_max_beta = 1.0

3. Explore and Analyze Topics

📝 Topic Inspection:

  • get_topic_words() — top words per topic
  • get_covariate_words() — word shifts by content covariates
  • get_top_docs() — representative docs per topic

📈 Metadata Effects:

  • estimate_effect() — topic prevalence regression (linear)

🖼️ Visualizations:

  • plot_topic_word_distribution() — word clouds / bar plots
  • visualize_docs() — 2D projection (UMAP, t-SNE, PCA)
  • visualize_words() — semantic word embeddings
  • visualize_topics() — semantic topic embeddings

🎯 Supervised Prediction:

  • get_predictions() — returns classification or regression outputs (if labels were used)

📚 Tutorials

Get started with example notebooks in notebooks/.

The dataset used in these notebooks can be downloaded here and should be placed in the data folder.


📖 References


⚠️ Disclaimer

This package is under active development 🚧 — feedback and contributions are welcome!

About

A torch implementation of the Generalized Topic Model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors