Batch processing on aws

This project shows one way to perform a batch processing using mainly AWS and a few open-source tools.

Overview

The current work aims to give answers to business questions concerning bicycle rentals in the city of London from 2021 to January 2022. To do so, we are going to build a data pipeline which collects data from multiple sources, applies transformations and displays the preprocessed data into a dashboard.

The following diagram illustrates a high-level structure of the pipeline where data flows from different sources to the final visualisation tool.

The Goal

The end goal of the current project is to preprocess the data on AWS platform and get useful insights from it. We can learn more from the data by responding to some of the following business questions on the final dashboard.

At what time or which hour of the day has the most active rental in average?
Which area has the most active bike rentals in London?
Which day of the week is the most active in general?
What is the global trend for daily rentals over the year?

The dataset

We are going to process 3 datasets along this project.

Cycling journey dataset from January 2021 to January 2022. It is spread into multiple files in the Transport for London (TFL) website. We will scrap the web page to extract all the relevant links. Then download each file afterwards. This dataset contains the main features for every cycling journey, including: the locations of start/end point of each journey, the timestamps for both departure and arrival, etc.
Stations dataset encompasses the details of every station involved in a journey. This dataset is quite outdated as it does not include stations which were added after 2016. To solve this issue, We will add in this old dataset, all the new stations we encounter in each journey. The stations were found in a forum What do they know and can be downloaded directly from here.
Weather dataset includes daily weather data in the city of London from January 2021 to January 2022. It was originally retrieved from Visual Crossing website and made available to download from this link.

In total the cycling journey data contains: 10925928 entries, stations: 808 and weather: 396.

Data modeling

We are going to build a Star Schema which comprises one fact and multiple dimension tables for our Data Warehouse.

The Entity Relational Diagram (ERD) for the final Data Warehouse is represented in the following image:

In the transformation phase, several columns from both weather and journey data will be removed. Also, we will add dimension table dim_datetime which will contain the reference for all datetime-related columns.

The given schema will facilitate the exploration of the whole data in order to answer relevant business questions about them.

Tools

Terraform: an open-source tool which provides Infrastructure as Code (IaC). It allows us to build and maintain our AWS infrastructure including: Redshift, S3 and EC2 instance. We will not include our EMR clusters in Terraform as they will be manually added and terminated from Airflow when we need them.
Apache Airflow: an open-source tool to programmatically author, schedule and monitor workflows. The majority of data tasks in the project will be monitored on Airflow.
Selenium and BeautifulSoup are packages which help us to perform web scraping. BeautifulSoup cannot scrape a webpage that displays data lazily, this is where Selenium comes into the picture as it can wait for a specific content to load on the page before doing further processing.
AWS Simple Storage System or Simple Storage System: provides a large storage for us to create a Data Lake. We will store all the raw data in this location. Also, the preprocessed data will be stored in S3 before being loaded to Redshift.
Apache Spark: an open-source software that can efficiently process Big Data in a distributed or parallel system. We will use PySpark (Spark with Python) to transform the raw data and prepare them for the Data Warehouse on Redshift.
AWS Elastic MapReduce, a managed cluster platform that allows the running of big data tools such as Spark and Hadoop. We will employ AWS EMR to run our Spark jobs during the transformation phase of the data.
AWS Redshift, a fully managed and highly scalable data warehouse solution offered by Amazon. We will build our Data Warehouse on Redshift and we will make the data available for visualisation tools from there.
Metabase another open-source software that allows an easy visualisation and analytics of structured data. We will build a dashboard with Metabase to better visualise our data stored in Redshift.
Docker, a set of platform as a service which containerise softwares, allowing them to act the same way across multiple platforms. In this project, we will run Airflow and Metabase on Docker.

Scalability

It is always a good practice to consider scalability scenarios when building a data pipeline. The significant increase of the data in the future is much expected.

For instance, if the volume of the data has increased 500x or even as high as 1000x, that should not break our pipeline.

We need to scale our EMR cluster nodes either horizontally or both vertically and horizontally

Horizontal scale refers to adding more cluster nodes to process the high-volume data.
Vertical and Horizontal scale means that we increase the performance of existing nodes. Then we also add new nodes to the cluster.

Running the project

1. Requirements

In order to run the project smoothly, a few requirements should be met:

AWS account with sufficient permissions to access and work on S3, Redshift, and EMR. To do so:
- Go to IAM in AWS console.
- Create a new user
- Add permissions to that new user: AmazonS3FullAccess, AmazonRedshiftFullAccess, AdministratorAccess, AmazonEMRFullAccessPolicy_v2, AmazonEMRServicePolicy_v2, AmazonEC2FullAccess.
- In the "Security credentials" tab, create access key and download the .csv file.
It is also necessary to have the AWS account preconfigured (i.e having ~/.aws/credentials and ~/.aws/config available in your local environment). This AWS Doc shows the essential steps to setup local environment with AWS.
Docker and Docker Compose, preinstalled in your local environment. Otherwise, they can be installed from Get Docker.
Terraform preinstalled in your local environment. If not, please install it by following the instructions given in the official download page.

2. Clone the repository

git clone https://github.com/HoracioSoldman/batch-processing-on-aws.git

3. Run Terraform

We are going to use Terraform to build our AWS infrastructure

From the project root folder, move to the ./terraform directory

cd terraform

Run terraform commands one by one

Initialization
```
terraform init
```
Planning
```
terraform plan
```
Applying
```
terraform apply
```

4. Create the Data Warehouse

Go to the AWS Redshift cluster which was freshly created from Terraform.
Connect to your database then go to Query Data.
Manually Copy the content of CyclingERD.sql into the query field and RUN the command. This will create the tables and attach constraints to them.

5. Run Airflow

From the project root folder, move to the ./airflow directory
```
cd airflow
```
Create environment variables in the .env file for our future Docker containers.
```
cp .env.example .env
```
Fill in the content of the .env file. The value for AIRFLOW_UID is obtained from the following command:
```
echo -e "AIRFLOW_UID=$(id -u)"
```
Then the value for AIRFLOW_GID can be left to 0.
- Build our extended Airflow Docker image
```
docker build -t airflow-img .
```
If you would prefer having another tag, replace the airflow-img by whatever you like. Then just make sure that you also change the image tag in docker-compose.yaml at line 48: image: <your-tag>:latest.

This process might take up to 15 minutes or even more depending on your internet speed. At this stage, Docker also instals several packages defined in the requirements.txt.
Run docker-compose to launch Airflow

Initialise Airflow
```
docker-compose up airflow-init 
```
Launch Airflow
```
docker-compose up
```
This last command launched Airflow Postgres internal database, Airflow Scheduler and Airflow Webserver which could have been launched separately if we did not use Docker.

6. Run the Airflow DAGs

Once Airflow is up and running, we can now proceed to the most exciting part of the project.

The initialisation DAGs (init_?_*_dag) are interdependent. In essence, each DAG wait the success run of its predecessor before starting its tasks. For instance, init_1_spark_emr_dag will not be started until init_0_ingestionto_s3_dag is complete successfully. In order to trigger these DAGs, please enable the 4 of them SIMULTANEOUSLY.

The processor DAGs (proc_?_*_dag) on the other hand, needs to be started individually. It is necessary to wait for 4 initialisation DAGs to complete before starting the processor ones. To run these last 3 DAGs, please enable the proc_0_ingestion_to_s3_dag, wait for it to finish its tasks before enabling the next DAG: proc_1_spark_emr_dag. Likewise, it is necessary to wait unti the end of proc_1_spark_emr_dag process before enabling the last DAG: proc_2_s3_to_redshift_dag

The following screenshot shows a success run of the first DAG.

After all dags operations, we can now move to Metabase to visualise the data.

7. Visualise data on Metabase

Again we will install and run Metabase in a Docker container.

docker run -d -p 3033:3000 --name metabase metabase/metabase

For the very first time of its execution, the above command downloads the latest Docker image available for Metabase before exposing the application on port 3033.

Once the above command finishes its execution, Metabase should be available at http://localhost:3033.

We can now connect our Redshift database to this platform and visualise the data in multiple charts.

The following screenshot displays a part of our final dashboard which clearly shows some useful insights about bicycle rides in different dimensions.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
airflow		airflow
images		images
metabase		metabase
notebook		notebook
terraform		terraform
.gitignore		.gitignore
CyclingERD.sql		CyclingERD.sql
README.md		README.md
services.md		services.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Batch processing on aws

Table of contents

Overview

The Goal

The dataset

Data modeling

Tools

Scalability

Running the project

1. Requirements

2. Clone the repository

3. Run Terraform

4. Create the Data Warehouse

5. Run Airflow

6. Run the Airflow DAGs

7. Visualise data on Metabase

About

Uh oh!

Releases

Packages

Languages

HoracioSoldman/batch-processing-on-aws

Folders and files

Latest commit

History

Repository files navigation

Batch processing on aws

Table of contents

Overview

The Goal

The dataset

Data modeling

Tools

Scalability

Running the project

1. Requirements

2. Clone the repository

3. Run Terraform

4. Create the Data Warehouse

5. Run Airflow

6. Run the Airflow DAGs

7. Visualise data on Metabase

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages