Persian WordCloud Generation with BERTopic and ParsBERT

This tool is designed to process Persian text content (e.g., Telegram messages) and generate a word cloud from the entire dataset. It uses the BERTopic model along with UMAP and the ParsBERT language model for topic modeling and extracting common n-grams.

Prerequisites

Before running the project, make sure Python 3.7 or higher is installed on your system.

Install dependencies

pip install -r requirements.txt

Recommended requirements.txt content:

pandas
bertopic
umap-learn
torch
transformers
wordcloud
matplotlib
nltk

You also need a Persian font such as Vazir-Bold.ttf in the project folder or the path specified in the code.

File Structure

.
├── main.py                    # Entry point of the program
├── generate_wordcloud.py      # Main code for wordcloud generation
├── embedder.py                # ParsBERTEmbedder class for vectorization
├── preprocess.py              # Text preprocessing functions
├── utils.py                   # n-gram and PMI weighting calculations
├── wordclouds/                # Output folder for images
└── README.md                  # This guide file

Expected Input

A CSV file with at least one column named:

txtContent — containing Persian text content

Example:

txtContent
This is a test text for analysis.
Various topics are seen in messages.

Run the Program

python main.py /path/to/file.csv

After execution, a word cloud image will be saved at wordclouds/wordcloud_all_data.png.

Notes

Uses ParsBERT for embedding.
Common 2-gram words are extracted and weighted based on PMI and topic distribution.
Output is in Persian and uses a suitable font.

Example Output

The image below is an example of the output generated by the program:

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
assets		assets
.gitignore		.gitignore
Dockerfile		Dockerfile
Gandom.ttf		Gandom.ttf
README.md		README.md
embedder.py		embedder.py
generate_wordcloud.py		generate_wordcloud.py
main.py		main.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Persian WordCloud Generation with BERTopic and ParsBERT

Prerequisites

Install dependencies

File Structure

Expected Input

Run the Program

Notes

Example Output

About

Uh oh!

Releases

Packages

Languages

Persian-NLP-Toolkit/BERT-topic-modelling

Folders and files

Latest commit

History

Repository files navigation

Persian WordCloud Generation with BERTopic and ParsBERT

Prerequisites

Install dependencies

File Structure

Expected Input

Run the Program

Notes

Example Output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages