This project shows one way to perform a batch processing using mainly AWS and a few open-source tools.
- Overview
- The Goal
- The dataset
- Data modeling
- Tools
- Scalability
- Running the project
- Project limitations
The current work aims to give answers to business questions concerning bicycle rentals in the city of London from 2021 to January 2022. To do so, we are going to build a data pipeline which collects data from multiple sources, applies transformations and displays the preprocessed data into a dashboard.
The following diagram illustrates a high-level structure of the pipeline where data flows from different sources to the final visualisation tool.
The end goal of the current project is to preprocess the data on AWS platform and get useful insights from it. We can learn more from the data by responding to some of the following business questions on the final dashboard.
-
At what time or which hour of the day has the most active rental in average?
-
Which area has the most active bike rentals in London?
-
Which day of the week is the most active in general?
-
What is the global trend for daily rentals over the year?
We are going to process 3 datasets along this project.
-
Cycling journey dataset from January 2021 to January 2022. It is spread into multiple files in the Transport for London (TFL) website. We will scrap the web page to extract all the relevant links. Then download each file afterwards. This dataset contains the main features for every cycling journey, including: the locations of start/end point of each journey, the timestamps for both departure and arrival, etc.
-
Stations dataset encompasses the details of every station involved in a journey. This dataset is quite outdated as it does not include stations which were added after 2016. To solve this issue, We will add in this old dataset, all the new stations we encounter in each journey. The stations were found in a forum What do they know and can be downloaded directly from here.
-
Weather dataset includes daily weather data in the city of London from January 2021 to January 2022. It was originally retrieved from Visual Crossing website and made available to download from this link.
In total the cycling journey data contains: 10925928 entries, stations: 808 and weather: 396.
We are going to build a Star Schema which comprises one fact and multiple dimension tables for our Data Warehouse.
The Entity Relational Diagram (ERD) for the final Data Warehouse is represented in the following image:

In the transformation phase, several columns from both weather and journey data will be removed. Also, we will add dimension table dim_datetime which will contain the reference for all datetime-related columns.
The given schema will facilitate the exploration of the whole data in order to answer relevant business questions about them.
-
Terraform: an open-source tool which provides
Infrastructure as Code (IaC). It allows us to build and maintain our AWS infrastructure including:Redshift,S3andEC2 instance. We will not include ourEMR clustersin Terraform as they will be manually added and terminated fromAirflowwhen we need them. -
Apache Airflow: an open-source tool to programmatically author, schedule and monitor workflows. The majority of data tasks in the project will be monitored on Airflow.
-
Selenium and BeautifulSoup are packages which help us to perform web scraping. BeautifulSoup cannot scrape a webpage that displays data lazily, this is where Selenium comes into the picture as it can wait for a specific content to load on the page before doing further processing.
-
AWS Simple Storage System or Simple Storage System: provides a large storage for us to create a Data Lake. We will store all the raw data in this location. Also, the preprocessed data will be stored in S3 before being loaded to Redshift.
-
Apache Spark: an open-source software that can efficiently process Big Data in a distributed or parallel system. We will use PySpark (Spark with Python) to transform the raw data and prepare them for the Data Warehouse on Redshift.
-
AWS Elastic MapReduce, a managed cluster platform that allows the running of big data tools such as Spark and Hadoop. We will employ AWS EMR to run our Spark jobs during the transformation phase of the data.
-
AWS Redshift, a fully managed and highly scalable data warehouse solution offered by Amazon. We will build our Data Warehouse on Redshift and we will make the data available for visualisation tools from there.
-
Metabase another open-source software that allows an easy visualisation and analytics of structured data. We will build a dashboard with Metabase to better visualise our data stored in Redshift.
-
Docker, a set of platform as a service which containerise softwares, allowing them to act the same way across multiple platforms. In this project, we will run Airflow and Metabase on Docker.
It is always a good practice to consider scalability scenarios when building a data pipeline. The significant increase of the data in the future is much expected.
For instance, if the volume of the data has increased 500x or even as high as 1000x, that should not break our pipeline.
We need to scale our EMR cluster nodes either horizontally or both vertically and horizontally
-
Horizontal scale refers to adding more cluster nodes to process the high-volume data.
-
Vertical and Horizontal scale means that we increase the performance of existing nodes. Then we also add new nodes to the cluster.
In order to run the project smoothly, a few requirements should be met:
-
AWS account with sufficient permissions to access and work on S3, Redshift, and EMR. To do so:
- Go to IAM in AWS console.
- Create a new user
- Add permissions to that new user:
AmazonS3FullAccess,AmazonRedshiftFullAccess,AdministratorAccess,AmazonEMRFullAccessPolicy_v2,AmazonEMRServicePolicy_v2,AmazonEC2FullAccess. - In the "Security credentials" tab, create access key and download the
.csvfile.
-
It is also necessary to have the AWS account preconfigured (i.e having
~/.aws/credentialsand~/.aws/configavailable in your local environment). This AWS Doc shows the essential steps to setup local environment with AWS. -
Docker and Docker Compose, preinstalled in your local environment. Otherwise, they can be installed from Get Docker.
-
Terraform preinstalled in your local environment. If not, please install it by following the instructions given in the official download page.
git clone https://github.com/HoracioSoldman/batch-processing-on-aws.gitWe are going to use Terraform to build our AWS infrastructure
From the project root folder, move to the ./terraform directory
cd terraformRun terraform commands one by one
-
Initialization
terraform init
-
Planning
terraform plan
-
Applying
terraform apply
-
Go to the AWS Redshift cluster which was freshly created from Terraform.
-
Connect to your database then go to
Query Data. -
Manually
Copythe content of CyclingERD.sql into the query field andRUNthe command. This will create the tables and attach constraints to them.
-
From the project root folder, move to the
./airflowdirectorycd airflow -
Create environment variables in the
.envfile for our future Docker containers.cp .env.example .env
-
Fill in the content of the
.envfile. The value forAIRFLOW_UIDis obtained from the following command:echo -e "AIRFLOW_UID=$(id -u)"
Then the value for
AIRFLOW_GIDcan be left to0.- Build our extended Airflow Docker image
docker build -t airflow-img .If you would prefer having another tag, replace the
airflow-imgby whatever you like. Then just make sure that you also change the image tag in docker-compose.yaml at line48:image: <your-tag>:latest.This process might take up to 15 minutes or even more depending on your internet speed. At this stage, Docker also instals several packages defined in the requirements.txt.
-
Run docker-compose to launch Airflow
Initialise Airflow
docker-compose up airflow-init
Launch Airflow
docker-compose up
This last command launched
Airflow Postgresinternal database,Airflow SchedulerandAirflow Webserverwhich could have been launched separately if we did not use Docker.
Once Airflow is up and running, we can now proceed to the most exciting part of the project.
The initialisation DAGs (init_?_*_dag) are interdependent. In essence, each DAG wait the success run of its predecessor before starting its tasks.
For instance, init_1_spark_emr_dag will not be started until init_0_ingestionto_s3_dag is complete successfully.
In order to trigger these DAGs, please enable the 4 of them SIMULTANEOUSLY.
The processor DAGs (proc_?_*_dag) on the other hand, needs to be started individually.
It is necessary to wait for 4 initialisation DAGs to complete before starting the processor ones.
To run these last 3 DAGs, please enable the proc_0_ingestion_to_s3_dag, wait for it to finish its tasks before enabling the next DAG: proc_1_spark_emr_dag.
Likewise, it is necessary to wait unti the end of proc_1_spark_emr_dag process before enabling the last DAG: proc_2_s3_to_redshift_dag
The following screenshot shows a success run of the first DAG.
After all dags operations, we can now move to Metabase to visualise the data.
Again we will install and run Metabase in a Docker container.
docker run -d -p 3033:3000 --name metabase metabase/metabaseFor the very first time of its execution, the above command downloads the latest Docker image available for Metabase before exposing the application on port 3033.
Once the above command finishes its execution, Metabase should be available at http://localhost:3033.
We can now connect our Redshift database to this platform and visualise the data in multiple charts.
The following screenshot displays a part of our final dashboard which clearly shows some useful insights about bicycle rides in different dimensions.


