Diploma thesis for processing Big Data from software processes. Incorporates Apache Kafka, Flink and Cassandra to handle real time and historical GitHub events for software processes analyses.
The running app consists of the real time data analysis and the historical data analysis and runs on a Linux terminal.
The thesis text itself (in Greek) can be found at My Thesis online.
Ingest real time GitHub events
- Step 1: Pull and build the docker images
- Step 2: Compose Kafka, Flink and Cassandra
- Step 3: Compose the real time GitHub events Kafka Producer
- Step 4: Ingest real GitHub events using a Pyflink job
- Step 5: Expose GitHub events to the UI and deploy it
- Step 1: Run bash script to create directories for the kafka docker container
- Step 2: Start services kafka, cassandra and flask app ui
- Step 3: Download events of the designated gharchive files, thin them and produce them to kafka
- Step 4: Deploy screen 2 pyflink jobs (job getting the screen 2 data)
- Step 5: Deploy screen 3 pyflink job (job getting the screen 3 data)
- Step 6: Deploy screen 4 pyflink job (job getting the screen 4 data)
- Step 7 (optional): Cancel all jobs (you can also do so manually from the UI)
- Step 8 (optional): Delete messages of the 'historical-raw-events' topic if the topic takes up too much space
# For services:
# kafka
docker image pull bitnami/kafka:3.9
# kafka-ui
docker pull provectuslabs/kafka-ui:v0.7.2
# cassandra
docker image pull cassandra:4.1.7
# cassandra-ui
docker pull ipushc/cassandra-web:v1.1.5
# jobmanager, taskmanager
docker build -f Dockerfile-pyflink -t pyflink:1.18.1 .
# python-real-time-events-producer, event-data-exposing-server,
# events-flask-app, python-historical-events-producer
docker build -f Dockerfile-python -t python:3.10-script-executing-image .
# Run bash script to create directories for the kafka docker container
./helpers/setup-kafka-and-ui.sh
# Compose the actual services
docker compose up -d kafka kafka-ui jobmanager taskmanager-real-time cassandra_host cassandra-ui Now you should be able to see
- The kafka-ui at localhost:8080
- The flink web ui at localhost:8081
- The cassandra-ui at localhost:8083
docker compose up -d python-real-time-events-producerPyflink job to store the data of screen 1 in the UI
Deploy the real time job for screen 1:
docker exec -d -i jobmanager bash -c './bin/flink run -pyclientexec /usr/bin/python -py /opt/flink/usrlib/screen_1_q1_q5_flink_job.py --config_file_path /opt/flink/usrlib/getting-started-in-docker.ini' docker compose up -d event-data-exposing-server events-flask-appNow you should be able to see
- The flask app UI at localhost:5100
All terminals below are in the project's root directory
First pull and build the images if not done already (see Step 1: Pull and build the docker images)
Execute the bash script below if it was not executed already for the real time GitHub events ingestion part
./helpers/setup-kafka-and-ui.sh# Start the services
docker compose up kafka kafka-ui jobmanager taskmanager-1 cassandra_host cassandra-ui event-data-exposing-server events-flask-app
# Stop the services
docker compose down kafka kafka-ui jobmanager taskmanager-1 cassandra_host cassandra-ui event-data-exposing-server events-flask-appNow you should be able to see
- The kafka-ui at localhost:8080
- The flink web ui at localhost:8081
- The cassandra-ui at localhost:8083
- The flask app UI at localhost:5100
# For the historical analysis, choose the events of December 2024 you want to download and thin in files historical-files-thinner, historical-files-thinner-2 (and similarly for 3 and 4) in lines:
# starting_date_formatted = <earlier-designated-date>
# ending_date_formatted = <older-designated-date>
# of file historical-events-thinner.py (and historical-events-thinner-2.py, historical-events-thinner-3.py etc)
# (Change the dates as you choose in the format:
# '2024-12-d-h' (day -d- should be zero padded but the hour -h- should not)
# Example: '2024-12-01-0' for the 12 am on 5/12/2024
# '2024-12-06-15' for the 3 pm on 15/12/2024
# Another example:
# starting_date_formatted = '2024-12-09-1'
# ending_date_formatted = '2024-12-09-3'
docker compose up python-historical-events-thinner # (for a single downloaded and thinner)
# For multiple downloaders and thinners running in parallel:
docker compose up python-historical-events-thinner-2
docker compose up python-historical-events-thinner-3
docker compose up python-historical-events-thinner-4
# Create the topic
# Note: Ignore the error on the deletion of the topic as the topic has not been created yet
./delete_and_recreate_topic.sh
# Do the same as above for the produced files:
# (Change the dates as you choose in the format:
# '2024-12-d-h' (day -d- should be zero padded but the hour -h- should not)
# Example: '2024-12-01-0' for the 12 am on 5/12/2024
# '2024-12-06-15' for the 3 pm on 15/12/2024
# Another example:
# starting_date_formatted = '2024-12-09-1'
# ending_date_formatted = '2024-12-09-3'
docker compose up python-historical-events-producerIn terminals 5-7, change the pyclientexec option to your host's python path (e.g. /usr/bin/python).
docker exec -i jobmanager bash -c './bin/flink run -pyclientexec /usr/bin/python -py /opt/flink/usrlib/screen_2_q6_q8_flink_job_q6b_q7h.py --config_file_path /opt/flink/usrlib/getting-started-in-docker.ini'
docker exec -i jobmanager bash -c './bin/flink run -pyclientexec /usr/bin/python -py /opt/flink/usrlib/screen_2_q6_q8_flink_job_q8b_q8h.py --config_file_path /opt/flink/usrlib/getting-started-in-docker.ini'docker exec -i jobmanager bash -c './bin/flink run -pyclientexec /usr/bin/python -py /opt/flink/usrlib/screen_3_q9_q10_flink_job.py --config_file_path /opt/flink/usrlib/getting-started-in-docker.ini'docker exec -i jobmanager bash -c './bin/flink run -pyclientexec /usr/bin/python -py /opt/flink/usrlib/screen_4_q11_q15_flink_job.py --config_file_path /opt/flink/usrlib/getting-started-in-docker.ini' docker compose up cancel-all-flink-jobsStep 8 (optional): Delete messages of the 'historical-raw-events' topic if the topic takes up too much space
# Free up the space of the topic (delete its messages and make its size = 0)
cd usrlib
./delete_and_recreate_topic.sh