Skip to content

AR6420/TwitterSentimentAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Twitter Sentiment Analysis - Ukraine-Russia War

A comprehensive sentiment analysis and topic modeling project analyzing tweets related to the Ukraine-Russia conflict. This project processes Twitter data to understand public sentiment, identify key discussion topics, and analyze geographical patterns in social media discourse.

Table of Contents

Overview

This project performs natural language processing (NLP) and sentiment analysis on tweets discussing the Ukraine-Russia war. It provides insights into:

  • Public sentiment across different geographical locations
  • Key topics of discussion in high-engagement tweets
  • Temporal and spatial patterns in social media discourse
  • Classification of tweets by engagement level and location

Features

1. Data Preprocessing

  • Removal of duplicate tweets
  • Language filtering (English tweets only)
  • Data cleaning and column optimization
  • Missing value handling

2. Geographical Analysis

  • Russia Location Tweets: Analysis of tweets from users in Russia
  • Ukraine Location Tweets: Analysis of tweets from users in Ukraine
  • Other Locations: Analysis of tweets from all other geographical regions

3. High-Impact Tweet Analysis

  • Filtering tweets with significant engagement (likes, retweets, replies > 1)
  • Topic modeling on high-impact content
  • Sentiment classification

4. Topic Modeling

  • Latent Dirichlet Allocation (LDA): Probabilistic topic modeling with 5, 7, and 10 topics
  • Non-Negative Matrix Factorization (NMF): Alternative topic modeling approach
  • Interactive Visualization: Using pyLDAvis for topic exploration

5. Sentiment Analysis

  • VADER Sentiment Analysis: Using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner)
  • Three-class Classification: Positive, Negative, and Neutral sentiments
  • Location-based Sentiment: Separate analysis for Russia, Ukraine, and other locations

Dataset

The project uses the Ukraine_tweetys.txt dataset containing tweets about the Ukraine-Russia conflict.

Required Dataset Columns:

  • Tweet Id: Unique identifier for each tweet
  • Text: Tweet content
  • Username: Twitter username
  • Follower Count: Number of followers
  • like count: Number of likes
  • retweet count: Number of retweets
  • reply count: Number of replies
  • location: User's location
  • language: Tweet language
  • verified: Verification status

Note: The dataset file is not included in this repository. You'll need to provide your own Ukraine_tweetys.txt file.

Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager
  • Jupyter Notebook or Google Colab

Step 1: Clone the Repository

git clone https://github.com/AR6420/TwitterSentimentAnalysis.git
cd TwitterSentimentAnalysis

Step 2: Install Required Packages

pip install pandas numpy nltk gensim spacy scikit-learn pyLDAvis

Step 3: Download NLTK and spaCy Resources

# Download NLTK data
python -m nltk.downloader vader_lexicon stopwords

# Download spaCy English model
python -m spacy download en_core_web_sm

Step 4: Prepare Dataset

Place your Ukraine_tweetys.txt file in the project root directory.

Usage

Option 1: Running the Python Script (Recommended)

The standalone Python script provides a complete, automated analysis pipeline:

python twitter_sentiment_analysis.py

This will:

  • Load and clean the data
  • Perform location-based filtering
  • Extract high-impact tweets
  • Run topic modeling (LDA and NMF)
  • Apply sentiment analysis
  • Save all results to CSV files

Option 2: Running the Jupyter Notebook

For interactive exploration and visualization:

  1. Open Jupyter Notebook:

    jupyter notebook Twitter_SentimentAnalysis.ipynb
  2. Or use Google Colab:

    • Click the "Open in Colab" badge at the top of the notebook
    • Upload your Ukraine_tweetys.txt file to Colab
  3. Run all cells sequentially:

    • The notebook is designed to be run from top to bottom
    • Each section builds on previous results

Quick Start Example

import pandas as pd

# Load the cleaned data
df = pd.read_csv('cleaned_tweets.csv')

# View high-impact tweets with topics and sentiment
df_highimpact = pd.read_csv('highimpact_nmf.csv')
print(df_highimpact[['Text', 'Topic Label', 'sentiment_nltk']].head())

Methodology

1. Data Cleaning

# Remove duplicates
df = df.drop_duplicates()

# Filter English tweets
df_en = df[df['language']=='en']

# Remove unnecessary columns
df_clean = df_en.drop(columns=['Tweet Id', 'Unnamed: 0', 'Unnamed: 0.1', 'verified'])

2. Location-Based Filtering

The notebook implements sophisticated location filtering logic to categorize tweets:

  • Russia: Tweets from users with "russia" in location (excluding false positives)
  • Ukraine: Tweets from users with "ukraine" in location (with extensive filtering)
  • Other: All remaining tweets

3. Text Preprocessing Pipeline

  1. Lowercasing: Convert all text to lowercase
  2. Tokenization: Split text into individual words
  3. Stop Word Removal: Remove common English stop words
  4. Special Character Removal: Remove URLs, mentions, hashtags
  5. Bigram/Trigram Creation: Identify common phrase patterns
  6. Lemmatization: Reduce words to their base form using spaCy

4. Topic Modeling

LDA (Latent Dirichlet Allocation)

  • Tested with 5, 7, and 10 topics
  • Uses Gensim library
  • Interactive visualization with pyLDAvis

NMF (Non-Negative Matrix Factorization)

  • 5 topics identified:
    1. War Cause: Discussion about war origins and causes
    2. Military Progress: Updates on military operations
    3. Refugee and Human Rights: Humanitarian concerns
    4. Economic Impact: Economic consequences of the war
    5. News and Journalism: Media coverage and reporting

5. Sentiment Analysis

Using VADER (Valence Aware Dictionary and sEntiment Reasoner):

# Sentiment Classification Rules
- Positive: compound score >= 0.33
- Negative: compound score <= -0.33
- Neutral: -0.33 < compound score < 0.33

Output Files

The notebook generates several CSV files:

File Name Description
cleaned_tweets.csv All English tweets after cleaning
Russia_loc.csv Tweets from Russia location
Ukraine_loc.csv Tweets from Ukraine location
Other_loc.csv Tweets from other locations
highimpact.csv High-engagement tweets
highimpact_nmf.csv High-impact tweets with topic labels and sentiment

Results

Topic Distribution (High-Impact Tweets)

The NMF topic modeling reveals five main discussion themes:

  1. War Cause: Analysis of conflict origins and contributing factors
  2. Military Progress: Updates on battlefield developments
  3. Refugee and Human Rights: Focus on humanitarian crisis
  4. Economic Impact: Discussion of sanctions and economic consequences
  5. News and Journalism: Media coverage and information warfare

Sentiment Distribution

The sentiment analysis provides insights into public opinion across different locations and topics. Results show the distribution of positive, negative, and neutral sentiments in:

  • High-impact tweets
  • Russia-location tweets
  • Ukraine-location tweets

Visualization

The notebook includes:

  • Interactive pyLDAvis visualizations for topic exploration
  • Topic word distributions
  • Sentiment breakdowns by category

Dependencies

Core Libraries

pandas>=1.3.0
numpy>=1.21.0
nltk>=3.6.0
gensim>=4.0.0
spacy>=3.0.0
scikit-learn>=0.24.0
pyLDAvis>=3.3.0

Python Version

  • Python 3.7 or higher recommended

Additional Requirements

  • en_core_web_sm (spaCy English model)
  • NLTK data: vader_lexicon, stopwords

Project Structure

TwitterSentimentAnalysis/
│
├── twitter_sentiment_analysis.py      # Python script (recommended)
├── Twitter_SentimentAnalysis.ipynb    # Jupyter notebook (alternative)
├── README.md                           # This file
├── Ukraine_tweetys.txt                 # Input dataset (not included)
│
└── Output Files (generated):
    ├── cleaned_tweets.csv
    ├── Russia_loc.csv
    ├── Ukraine_loc.csv
    ├── Other_loc.csv
    ├── highimpact.csv
    └── highimpact_nmf.csv

Technical Details

Text Preprocessing Function

The clean_text_nltk() function performs comprehensive text cleaning:

  • Removes special characters and punctuation
  • Filters out stop words
  • Removes Twitter-specific elements (@mentions, URLs, hashtags)
  • Filters words shorter than 3 characters

Sentiment Analysis Function

The apply_sentiment_analysis() function:

  • Takes a DataFrame and text column name
  • Applies VADER sentiment analysis
  • Returns DataFrame with sentiment labels
  • Reusable across different datasets

Performance Considerations

  • Memory Usage: Large datasets may require significant RAM
  • Processing Time: Topic modeling and sentiment analysis can be time-intensive
  • Optimization: DataFrame operations have been optimized to avoid inefficient loops

Known Limitations

  1. Language: Analysis limited to English tweets only
  2. Location Accuracy: User-provided location data may be inaccurate or incomplete
  3. Sentiment Nuance: VADER may miss contextual nuances and sarcasm
  4. Topic Labels: Manual interpretation of topic modeling results
  5. Dataset Dependency: Requires external dataset file

Future Enhancements

Potential improvements for this project:

  • Multi-language sentiment analysis
  • Deep learning models for sentiment classification
  • Real-time tweet streaming and analysis
  • Network analysis of tweet propagation
  • Temporal trend analysis
  • Automated topic labeling
  • Comparative analysis across different conflicts

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Citation

If you use this project in your research or work, please cite:

@software{twitter_sentiment_analysis,
  title={Twitter Sentiment Analysis - Ukraine-Russia War},
  author={AR6420},
  year={2024},
  url={https://github.com/AR6420/TwitterSentimentAnalysis}
}

Acknowledgments

  • NLTK and VADER for sentiment analysis tools
  • Gensim for topic modeling capabilities
  • spaCy for NLP preprocessing
  • The open-source community for various Python libraries

License

This project is open source and available under the MIT License.

Contact

For questions, issues, or collaboration opportunities, please open an issue on GitHub.


Disclaimer: This project is for educational and research purposes only. The sentiment analysis results represent computational analysis of public Twitter data and should not be interpreted as definitive measures of public opinion.

About

Sentiment Analysis of Ukraine-Russia War

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors