This project aims to perform real-time analysis of public sentiment toward AI. The primary sentiment categories we focused on are Hope, Fear, and Neutral.
To run the project, follow these steps:
-
Create a
.envfile inside theconfig/folder containing your Reddit credentials. You can use the providedconfig/template.envfile as a reference. -
Run the following command to start the project:
docker-compose up --build -d
- cassandra.cluster.NoHostAvailable: If you encounter this kind of error while starting the spark-streaming container, just restart it again. It's just waiting for Cassandra to be up and running.
DistilBERT, a lightweight transformer model, is used for sentiment analysis due to its efficiency and multi-encoder architecture. However, the primary goal of this project is to build a scalable and efficient data streaming infrastructure rather than focusing on model performance. The dataset used for fine-tuning the model is not of high quality. All data is stored in the data/ folder.
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Neutral | 0.71 | 0.65 | 0.68 | 314 |
| Hope | 0.76 | 0.76 | 0.76 | 304 |
| Fear | 0.80 | 0.85 | 0.82 | 327 |
| Accuracy | - | - | 0.76 | 945 |
| Macro Avg | 0.75 | 0.75 | 0.75 | 945 |
| Weighted Avg | 0.75 | 0.76 | 0.75 | 945 |
| Neutral | Hope | Fear | |
|---|---|---|---|
| Neutral | 204 | 61 | 49 |
| Hope | 50 | 232 | 22 |
| Fear | 35 | 14 | 278 |
During approximately one hour of real-time streaming data, from 15:30 to 16:40, we observed that a total of 27,732 comments were processed in this experiment. These comments were classified as follows:

