Skip to content

This tool is designed to process Persian text content (e.g., Telegram messages) and generate a word cloud from the entire dataset. It uses the BERTopic model along with UMAP and the ParsBERT language model for topic modeling and extracting common n-grams.

Notifications You must be signed in to change notification settings

Persian-NLP-Toolkit/BERT-topic-modelling

Repository files navigation

Persian WordCloud Generation with BERTopic and ParsBERT

This tool is designed to process Persian text content (e.g., Telegram messages) and generate a word cloud from the entire dataset. It uses the BERTopic model along with UMAP and the ParsBERT language model for topic modeling and extracting common n-grams.


Prerequisites

Before running the project, make sure Python 3.7 or higher is installed on your system.

Install dependencies

pip install -r requirements.txt

Recommended requirements.txt content:

pandas
bertopic
umap-learn
torch
transformers
wordcloud
matplotlib
nltk

You also need a Persian font such as Vazir-Bold.ttf in the project folder or the path specified in the code.


File Structure

.
├── main.py                    # Entry point of the program
├── generate_wordcloud.py      # Main code for wordcloud generation
├── embedder.py                # ParsBERTEmbedder class for vectorization
├── preprocess.py              # Text preprocessing functions
├── utils.py                   # n-gram and PMI weighting calculations
├── wordclouds/                # Output folder for images
└── README.md                  # This guide file

Expected Input

A CSV file with at least one column named:

  • txtContent — containing Persian text content

Example:

txtContent
This is a test text for analysis.
Various topics are seen in messages.

Run the Program

python main.py /path/to/file.csv

After execution, a word cloud image will be saved at wordclouds/wordcloud_all_data.png.


Notes

  • Uses ParsBERT for embedding.
  • Common 2-gram words are extracted and weighted based on PMI and topic distribution.
  • Output is in Persian and uses a suitable font.

Example Output

The image below is an example of the output generated by the program:

Example Image

About

This tool is designed to process Persian text content (e.g., Telegram messages) and generate a word cloud from the entire dataset. It uses the BERTopic model along with UMAP and the ParsBERT language model for topic modeling and extracting common n-grams.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published