Published Power BI Link: https://app.powerbi.com/groups/me/reports/49e81a4e-d445-4f48-b539-e16644e5f70c/8114ca2b8762d315d2b3?experience=power-bi
Production-grade ETL pipeline for processing medical/vital signs data from S3 to AWS RDS PostgreSQL.
pip install python-dotenv
pip install boto3
pip install pandas
pip install sqlalchemy
pip install psycopg2-binaryOr install from requirements.txt:
pip install -r requirements.txt-
Install Dependencies (see pip commands above)
-
Configure Environment Variables
- Copy
.env.templateto.env - Fill in your AWS and RDS configuration:
cp .env.template .env
- Edit
.envwith your actual values:RDS_SECRET_NAME: Your AWS Secrets Manager secret nameS3_BUCKET_NAME: Your S3 bucket nameRDS_WRITER_ENDPOINT: Your RDS cluster writer endpointAWS_REGION: Your AWS region
- Copy
-
AWS Credentials
- Configure AWS credentials via AWS CLI:
aws configure - Or set
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYin.env
- Configure AWS credentials via AWS CLI:
Run the ETL pipeline:
python etl_process.py- Extract: Downloads
messy_health_data.csvfrom S3 - Transform:
- Removes duplicate rows
- Fills missing Heart Rate and Oxygen Sat (SpO2) with median
- Removes sensor-error outliers (Heart Rate > 200 or < 30)
- Converts Timestamp column to proper datetime format
- Load: Uploads cleaned data to RDS table
clinical_vitals
-
Python 3.8+
-
AWS account with S3, Secrets Manager, and RDS access
-
PostgreSQL database on AWS RDS