This tool is designed to process Persian text content (e.g., Telegram messages) and generate a word cloud from the entire dataset. It uses the BERTopic model along with UMAP and the ParsBERT language model for topic modeling and extracting common n-grams.
Before running the project, make sure Python 3.7 or higher is installed on your system.
pip install -r requirements.txtRecommended requirements.txt content:
pandas
bertopic
umap-learn
torch
transformers
wordcloud
matplotlib
nltk
You also need a Persian font such as
Vazir-Bold.ttfin the project folder or the path specified in the code.
.
├── main.py # Entry point of the program
├── generate_wordcloud.py # Main code for wordcloud generation
├── embedder.py # ParsBERTEmbedder class for vectorization
├── preprocess.py # Text preprocessing functions
├── utils.py # n-gram and PMI weighting calculations
├── wordclouds/ # Output folder for images
└── README.md # This guide file
A CSV file with at least one column named:
txtContent— containing Persian text content
Example:
| txtContent |
|---|
| This is a test text for analysis. |
| Various topics are seen in messages. |
python main.py /path/to/file.csvAfter execution, a word cloud image will be saved at wordclouds/wordcloud_all_data.png.
- Uses ParsBERT for embedding.
- Common 2-gram words are extracted and weighted based on PMI and topic distribution.
- Output is in Persian and uses a suitable font.
The image below is an example of the output generated by the program:
