SwissFeels, an interactive sentiment map of Switzerland, built for the EPFL Applied Data Analysis Autumn 2016 course
The goal of our project was to analyze a large dataset of geolocated tweets and construct an interactive sentiment map of Switzerland, similar to that of Happy Maps. We focused on characterizing the sentiment of the tweets as positive or negative towards a certain entity, i.e. "is this tweet positive or negative about company X?". The objective was to have an interactive visualization that takes a keyword as input, for example "CFF" (Swiss national railway) and displays the sentiment of each canton on the Swiss map.
The ADA course staff collected tweets from January to November 2016 that were geolocated in Switzerland. Each tweet was annotated with estimates of its language and sentiment. We filtered the original 50GB dataset to a more manageable collection of approximately 3.7 million tweets.
Data wrangling and analysis notebooks:
- First look at the data fields and format
- Fetch the data we want from Hadoop cluster and store it locally
- Exploratory analysis of the data with aggregate statistics and sentiment
- Searching the dataset with a few different subjects
- Test code for the Flask app
- Generate plots for our beautiful poster
Interactive visualization webapp using Flask, pandas and Folium:
- Flask webapp main code
- Backend data searching, map creation and tweet selection functions
- App package requirements
- Static logos and CSS
- Webapp HTML templates
- TopoJSON file for Swiss canton boundaries
- Note: The interactive viz saves every query result in the local directory app/maps
The following fields were necessary in order to process the tweets:
geo_state: the tweet's source cantonsentiment: the tweet's sentiment, either Positive, Neutral or Negative.
We also decided to keep other interesting fields:
author_gender: which can be MALE, FEMALE, or UNKNOWNlang: the language of the tweetmain: the raw text of the tweetpublished: the date and time the tweet was published
- There was one major issue with the dataset. The geolocation of the tweets was not collected prior to July 2016. This made ~60% of the data unusable.
- The
geo_statefield was often valid, but we had to filter out some outliers that were not Swiss cantons. These represented 0.4% of the data. - Another minor issue was the language detection. Somehow Spanish seems to be spoken a lot more frequently than Italian (a national language)! Looking further into this problem we found that many Italian-language tweets were mislabeled as Spanish.
- Twitter bots were a problem that we couldn't satisfactorily address. For example many local radio stations automatically tweet their playlists, which polluted the dataset.
- The sentiment analysis algorithm worked poorly on non-English tweets.
We built an interactive map of Switzerland that displays the mean sentiment of each Swiss canton. Thanks to the search function, it is possible to view the mean of a subset of tweets containing search terms such as "SBB CFF FFS". See the screenshots section for an example of mean sentiment. There's also an option to display a map of the proportion of tweets containing the search terms, as you can see in the screenshots section. Some matching tweets are displayed so that the user can verify that his/her query works well.
A poster was also presented to the Applied ML Days.
Overall, the SwissFeels project performs quite well. Some queries are polluted by bots or spurious matches, as our current implementation simply searches for string occurrences in the raw text. However, many queries are very clear ("skiing", etc.) and give interesting results. Labeling tweets with entity mentions would provide more reliable search results in the current implementation.





