Twitter Sentiment Analysis - Ukraine-Russia War

A comprehensive sentiment analysis and topic modeling project analyzing tweets related to the Ukraine-Russia conflict. This project processes Twitter data to understand public sentiment, identify key discussion topics, and analyze geographical patterns in social media discourse.

Overview

This project performs natural language processing (NLP) and sentiment analysis on tweets discussing the Ukraine-Russia war. It provides insights into:

Public sentiment across different geographical locations
Key topics of discussion in high-engagement tweets
Temporal and spatial patterns in social media discourse
Classification of tweets by engagement level and location

Features

1. Data Preprocessing

Removal of duplicate tweets
Language filtering (English tweets only)
Data cleaning and column optimization
Missing value handling

2. Geographical Analysis

Russia Location Tweets: Analysis of tweets from users in Russia
Ukraine Location Tweets: Analysis of tweets from users in Ukraine
Other Locations: Analysis of tweets from all other geographical regions

3. High-Impact Tweet Analysis

Filtering tweets with significant engagement (likes, retweets, replies > 1)
Topic modeling on high-impact content
Sentiment classification

4. Topic Modeling

Latent Dirichlet Allocation (LDA): Probabilistic topic modeling with 5, 7, and 10 topics
Non-Negative Matrix Factorization (NMF): Alternative topic modeling approach
Interactive Visualization: Using pyLDAvis for topic exploration

5. Sentiment Analysis

VADER Sentiment Analysis: Using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner)
Three-class Classification: Positive, Negative, and Neutral sentiments
Location-based Sentiment: Separate analysis for Russia, Ukraine, and other locations

Dataset

The project uses the Ukraine_tweetys.txt dataset containing tweets about the Ukraine-Russia conflict.

Required Dataset Columns:

Tweet Id: Unique identifier for each tweet
Text: Tweet content
Username: Twitter username
Follower Count: Number of followers
like count: Number of likes
retweet count: Number of retweets
reply count: Number of replies
location: User's location
language: Tweet language
verified: Verification status

Note: The dataset file is not included in this repository. You'll need to provide your own Ukraine_tweetys.txt file.

Installation

Prerequisites

Python 3.7 or higher
pip package manager
Jupyter Notebook or Google Colab

Step 1: Clone the Repository

git clone https://github.com/AR6420/TwitterSentimentAnalysis.git
cd TwitterSentimentAnalysis

Step 2: Install Required Packages

pip install pandas numpy nltk gensim spacy scikit-learn pyLDAvis

Step 3: Download NLTK and spaCy Resources

# Download NLTK data
python -m nltk.downloader vader_lexicon stopwords

# Download spaCy English model
python -m spacy download en_core_web_sm

Step 4: Prepare Dataset

Place your Ukraine_tweetys.txt file in the project root directory.

Usage

Option 1: Running the Python Script (Recommended)

The standalone Python script provides a complete, automated analysis pipeline:

python twitter_sentiment_analysis.py

This will:

Load and clean the data
Perform location-based filtering
Extract high-impact tweets
Run topic modeling (LDA and NMF)
Apply sentiment analysis
Save all results to CSV files

Option 2: Running the Jupyter Notebook

For interactive exploration and visualization:

Open Jupyter Notebook:

jupyter notebook Twitter_SentimentAnalysis.ipynb

Or use Google Colab:
- Click the "Open in Colab" badge at the top of the notebook
- Upload your Ukraine_tweetys.txt file to Colab
Run all cells sequentially:
- The notebook is designed to be run from top to bottom
- Each section builds on previous results

Quick Start Example

import pandas as pd

# Load the cleaned data
df = pd.read_csv('cleaned_tweets.csv')

# View high-impact tweets with topics and sentiment
df_highimpact = pd.read_csv('highimpact_nmf.csv')
print(df_highimpact[['Text', 'Topic Label', 'sentiment_nltk']].head())

Methodology

1. Data Cleaning

# Remove duplicates
df = df.drop_duplicates()

# Filter English tweets
df_en = df[df['language']=='en']

# Remove unnecessary columns
df_clean = df_en.drop(columns=['Tweet Id', 'Unnamed: 0', 'Unnamed: 0.1', 'verified'])

2. Location-Based Filtering

The notebook implements sophisticated location filtering logic to categorize tweets:

Russia: Tweets from users with "russia" in location (excluding false positives)
Ukraine: Tweets from users with "ukraine" in location (with extensive filtering)
Other: All remaining tweets

3. Text Preprocessing Pipeline

Lowercasing: Convert all text to lowercase
Tokenization: Split text into individual words
Stop Word Removal: Remove common English stop words
Special Character Removal: Remove URLs, mentions, hashtags
Bigram/Trigram Creation: Identify common phrase patterns
Lemmatization: Reduce words to their base form using spaCy

4. Topic Modeling

LDA (Latent Dirichlet Allocation)

Tested with 5, 7, and 10 topics
Uses Gensim library
Interactive visualization with pyLDAvis

NMF (Non-Negative Matrix Factorization)

5 topics identified:
1. War Cause: Discussion about war origins and causes
2. Military Progress: Updates on military operations
3. Refugee and Human Rights: Humanitarian concerns
4. Economic Impact: Economic consequences of the war
5. News and Journalism: Media coverage and reporting

5. Sentiment Analysis

Using VADER (Valence Aware Dictionary and sEntiment Reasoner):

# Sentiment Classification Rules
- Positive: compound score >= 0.33
- Negative: compound score <= -0.33
- Neutral: -0.33 < compound score < 0.33

Output Files

The notebook generates several CSV files:

File Name	Description
`cleaned_tweets.csv`	All English tweets after cleaning
`Russia_loc.csv`	Tweets from Russia location
`Ukraine_loc.csv`	Tweets from Ukraine location
`Other_loc.csv`	Tweets from other locations
`highimpact.csv`	High-engagement tweets
`highimpact_nmf.csv`	High-impact tweets with topic labels and sentiment

Results

Topic Distribution (High-Impact Tweets)

The NMF topic modeling reveals five main discussion themes:

War Cause: Analysis of conflict origins and contributing factors
Military Progress: Updates on battlefield developments
Refugee and Human Rights: Focus on humanitarian crisis
Economic Impact: Discussion of sanctions and economic consequences
News and Journalism: Media coverage and information warfare

Sentiment Distribution

The sentiment analysis provides insights into public opinion across different locations and topics. Results show the distribution of positive, negative, and neutral sentiments in:

High-impact tweets
Russia-location tweets
Ukraine-location tweets

Visualization

The notebook includes:

Interactive pyLDAvis visualizations for topic exploration
Topic word distributions
Sentiment breakdowns by category

Dependencies

Core Libraries

pandas>=1.3.0
numpy>=1.21.0
nltk>=3.6.0
gensim>=4.0.0
spacy>=3.0.0
scikit-learn>=0.24.0
pyLDAvis>=3.3.0

Python Version

Python 3.7 or higher recommended

Additional Requirements

en_core_web_sm (spaCy English model)
NLTK data: vader_lexicon, stopwords

Project Structure

TwitterSentimentAnalysis/
│
├── twitter_sentiment_analysis.py      # Python script (recommended)
├── Twitter_SentimentAnalysis.ipynb    # Jupyter notebook (alternative)
├── README.md                           # This file
├── Ukraine_tweetys.txt                 # Input dataset (not included)
│
└── Output Files (generated):
    ├── cleaned_tweets.csv
    ├── Russia_loc.csv
    ├── Ukraine_loc.csv
    ├── Other_loc.csv
    ├── highimpact.csv
    └── highimpact_nmf.csv

Technical Details

Text Preprocessing Function

The clean_text_nltk() function performs comprehensive text cleaning:

Removes special characters and punctuation
Filters out stop words
Removes Twitter-specific elements (@mentions, URLs, hashtags)
Filters words shorter than 3 characters

Sentiment Analysis Function

The apply_sentiment_analysis() function:

Takes a DataFrame and text column name
Applies VADER sentiment analysis
Returns DataFrame with sentiment labels
Reusable across different datasets

Performance Considerations

Memory Usage: Large datasets may require significant RAM
Processing Time: Topic modeling and sentiment analysis can be time-intensive
Optimization: DataFrame operations have been optimized to avoid inefficient loops

Known Limitations

Language: Analysis limited to English tweets only
Location Accuracy: User-provided location data may be inaccurate or incomplete
Sentiment Nuance: VADER may miss contextual nuances and sarcasm
Topic Labels: Manual interpretation of topic modeling results
Dataset Dependency: Requires external dataset file

Future Enhancements

Potential improvements for this project:

Multi-language sentiment analysis
Deep learning models for sentiment classification
Real-time tweet streaming and analysis
Network analysis of tweet propagation
Temporal trend analysis
Automated topic labeling
Comparative analysis across different conflicts

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes:

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Citation

If you use this project in your research or work, please cite:

@software{twitter_sentiment_analysis,
  title={Twitter Sentiment Analysis - Ukraine-Russia War},
  author={AR6420},
  year={2024},
  url={https://github.com/AR6420/TwitterSentimentAnalysis}
}

Acknowledgments

NLTK and VADER for sentiment analysis tools
Gensim for topic modeling capabilities
spaCy for NLP preprocessing
The open-source community for various Python libraries

License

This project is open source and available under the MIT License.

Contact

For questions, issues, or collaboration opportunities, please open an issue on GitHub.

Disclaimer: This project is for educational and research purposes only. The sentiment analysis results represent computational analysis of public Twitter data and should not be interpreted as definitive measures of public opinion.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Twitter_SentimentAnalysis.ipynb		Twitter_SentimentAnalysis.ipynb
twitter_sentiment_analysis.py		twitter_sentiment_analysis.py

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis - Ukraine-Russia War

Table of Contents

Overview

Features

1. Data Preprocessing

2. Geographical Analysis

3. High-Impact Tweet Analysis

4. Topic Modeling

5. Sentiment Analysis

Dataset

Required Dataset Columns:

Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Install Required Packages

Step 3: Download NLTK and spaCy Resources

Step 4: Prepare Dataset

Usage

Option 1: Running the Python Script (Recommended)

Option 2: Running the Jupyter Notebook

Quick Start Example

Methodology

1. Data Cleaning

2. Location-Based Filtering

3. Text Preprocessing Pipeline

4. Topic Modeling

LDA (Latent Dirichlet Allocation)

NMF (Non-Negative Matrix Factorization)

5. Sentiment Analysis

Output Files

Results

Topic Distribution (High-Impact Tweets)

Sentiment Distribution

Visualization

Dependencies

Core Libraries

Python Version

Additional Requirements

Project Structure

Technical Details

Text Preprocessing Function

Sentiment Analysis Function

Performance Considerations

Known Limitations

Future Enhancements

Contributing

Citation

Acknowledgments

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages