A comprehensive sentiment analysis and topic modeling project analyzing tweets related to the Ukraine-Russia conflict. This project processes Twitter data to understand public sentiment, identify key discussion topics, and analyze geographical patterns in social media discourse.
- Overview
- Features
- Dataset
- Installation
- Usage
- Methodology
- Output Files
- Results
- Dependencies
- Project Structure
- Contributing
- License
This project performs natural language processing (NLP) and sentiment analysis on tweets discussing the Ukraine-Russia war. It provides insights into:
- Public sentiment across different geographical locations
- Key topics of discussion in high-engagement tweets
- Temporal and spatial patterns in social media discourse
- Classification of tweets by engagement level and location
- Removal of duplicate tweets
- Language filtering (English tweets only)
- Data cleaning and column optimization
- Missing value handling
- Russia Location Tweets: Analysis of tweets from users in Russia
- Ukraine Location Tweets: Analysis of tweets from users in Ukraine
- Other Locations: Analysis of tweets from all other geographical regions
- Filtering tweets with significant engagement (likes, retweets, replies > 1)
- Topic modeling on high-impact content
- Sentiment classification
- Latent Dirichlet Allocation (LDA): Probabilistic topic modeling with 5, 7, and 10 topics
- Non-Negative Matrix Factorization (NMF): Alternative topic modeling approach
- Interactive Visualization: Using pyLDAvis for topic exploration
- VADER Sentiment Analysis: Using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner)
- Three-class Classification: Positive, Negative, and Neutral sentiments
- Location-based Sentiment: Separate analysis for Russia, Ukraine, and other locations
The project uses the Ukraine_tweetys.txt dataset containing tweets about the Ukraine-Russia conflict.
Tweet Id: Unique identifier for each tweetText: Tweet contentUsername: Twitter usernameFollower Count: Number of followerslike count: Number of likesretweet count: Number of retweetsreply count: Number of replieslocation: User's locationlanguage: Tweet languageverified: Verification status
Note: The dataset file is not included in this repository. You'll need to provide your own Ukraine_tweetys.txt file.
- Python 3.7 or higher
- pip package manager
- Jupyter Notebook or Google Colab
git clone https://github.com/AR6420/TwitterSentimentAnalysis.git
cd TwitterSentimentAnalysispip install pandas numpy nltk gensim spacy scikit-learn pyLDAvis# Download NLTK data
python -m nltk.downloader vader_lexicon stopwords
# Download spaCy English model
python -m spacy download en_core_web_smPlace your Ukraine_tweetys.txt file in the project root directory.
The standalone Python script provides a complete, automated analysis pipeline:
python twitter_sentiment_analysis.pyThis will:
- Load and clean the data
- Perform location-based filtering
- Extract high-impact tweets
- Run topic modeling (LDA and NMF)
- Apply sentiment analysis
- Save all results to CSV files
For interactive exploration and visualization:
-
Open Jupyter Notebook:
jupyter notebook Twitter_SentimentAnalysis.ipynb
-
Or use Google Colab:
- Click the "Open in Colab" badge at the top of the notebook
- Upload your
Ukraine_tweetys.txtfile to Colab
-
Run all cells sequentially:
- The notebook is designed to be run from top to bottom
- Each section builds on previous results
import pandas as pd
# Load the cleaned data
df = pd.read_csv('cleaned_tweets.csv')
# View high-impact tweets with topics and sentiment
df_highimpact = pd.read_csv('highimpact_nmf.csv')
print(df_highimpact[['Text', 'Topic Label', 'sentiment_nltk']].head())# Remove duplicates
df = df.drop_duplicates()
# Filter English tweets
df_en = df[df['language']=='en']
# Remove unnecessary columns
df_clean = df_en.drop(columns=['Tweet Id', 'Unnamed: 0', 'Unnamed: 0.1', 'verified'])The notebook implements sophisticated location filtering logic to categorize tweets:
- Russia: Tweets from users with "russia" in location (excluding false positives)
- Ukraine: Tweets from users with "ukraine" in location (with extensive filtering)
- Other: All remaining tweets
- Lowercasing: Convert all text to lowercase
- Tokenization: Split text into individual words
- Stop Word Removal: Remove common English stop words
- Special Character Removal: Remove URLs, mentions, hashtags
- Bigram/Trigram Creation: Identify common phrase patterns
- Lemmatization: Reduce words to their base form using spaCy
- Tested with 5, 7, and 10 topics
- Uses Gensim library
- Interactive visualization with pyLDAvis
- 5 topics identified:
- War Cause: Discussion about war origins and causes
- Military Progress: Updates on military operations
- Refugee and Human Rights: Humanitarian concerns
- Economic Impact: Economic consequences of the war
- News and Journalism: Media coverage and reporting
Using VADER (Valence Aware Dictionary and sEntiment Reasoner):
# Sentiment Classification Rules
- Positive: compound score >= 0.33
- Negative: compound score <= -0.33
- Neutral: -0.33 < compound score < 0.33The notebook generates several CSV files:
| File Name | Description |
|---|---|
cleaned_tweets.csv |
All English tweets after cleaning |
Russia_loc.csv |
Tweets from Russia location |
Ukraine_loc.csv |
Tweets from Ukraine location |
Other_loc.csv |
Tweets from other locations |
highimpact.csv |
High-engagement tweets |
highimpact_nmf.csv |
High-impact tweets with topic labels and sentiment |
The NMF topic modeling reveals five main discussion themes:
- War Cause: Analysis of conflict origins and contributing factors
- Military Progress: Updates on battlefield developments
- Refugee and Human Rights: Focus on humanitarian crisis
- Economic Impact: Discussion of sanctions and economic consequences
- News and Journalism: Media coverage and information warfare
The sentiment analysis provides insights into public opinion across different locations and topics. Results show the distribution of positive, negative, and neutral sentiments in:
- High-impact tweets
- Russia-location tweets
- Ukraine-location tweets
The notebook includes:
- Interactive pyLDAvis visualizations for topic exploration
- Topic word distributions
- Sentiment breakdowns by category
pandas>=1.3.0
numpy>=1.21.0
nltk>=3.6.0
gensim>=4.0.0
spacy>=3.0.0
scikit-learn>=0.24.0
pyLDAvis>=3.3.0
- Python 3.7 or higher recommended
en_core_web_sm(spaCy English model)- NLTK data:
vader_lexicon,stopwords
TwitterSentimentAnalysis/
│
├── twitter_sentiment_analysis.py # Python script (recommended)
├── Twitter_SentimentAnalysis.ipynb # Jupyter notebook (alternative)
├── README.md # This file
├── Ukraine_tweetys.txt # Input dataset (not included)
│
└── Output Files (generated):
├── cleaned_tweets.csv
├── Russia_loc.csv
├── Ukraine_loc.csv
├── Other_loc.csv
├── highimpact.csv
└── highimpact_nmf.csv
The clean_text_nltk() function performs comprehensive text cleaning:
- Removes special characters and punctuation
- Filters out stop words
- Removes Twitter-specific elements (@mentions, URLs, hashtags)
- Filters words shorter than 3 characters
The apply_sentiment_analysis() function:
- Takes a DataFrame and text column name
- Applies VADER sentiment analysis
- Returns DataFrame with sentiment labels
- Reusable across different datasets
- Memory Usage: Large datasets may require significant RAM
- Processing Time: Topic modeling and sentiment analysis can be time-intensive
- Optimization: DataFrame operations have been optimized to avoid inefficient loops
- Language: Analysis limited to English tweets only
- Location Accuracy: User-provided location data may be inaccurate or incomplete
- Sentiment Nuance: VADER may miss contextual nuances and sarcasm
- Topic Labels: Manual interpretation of topic modeling results
- Dataset Dependency: Requires external dataset file
Potential improvements for this project:
- Multi-language sentiment analysis
- Deep learning models for sentiment classification
- Real-time tweet streaming and analysis
- Network analysis of tweet propagation
- Temporal trend analysis
- Automated topic labeling
- Comparative analysis across different conflicts
Contributions are welcome! Please feel free to submit a Pull Request. For major changes:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
If you use this project in your research or work, please cite:
@software{twitter_sentiment_analysis,
title={Twitter Sentiment Analysis - Ukraine-Russia War},
author={AR6420},
year={2024},
url={https://github.com/AR6420/TwitterSentimentAnalysis}
}- NLTK and VADER for sentiment analysis tools
- Gensim for topic modeling capabilities
- spaCy for NLP preprocessing
- The open-source community for various Python libraries
This project is open source and available under the MIT License.
For questions, issues, or collaboration opportunities, please open an issue on GitHub.
Disclaimer: This project is for educational and research purposes only. The sentiment analysis results represent computational analysis of public Twitter data and should not be interpreted as definitive measures of public opinion.