Skip to content

Automate the deployment of a production-ready Hadoop HDFS cluster on GCP & AWS using Terraform and Ansible.

Notifications You must be signed in to change notification settings

h3nock/hdfs-cloud-starter

Repository files navigation

HDFS Cloud Starter

Deploy a production-ready Hadoop HDFS cluster on GCP and AWS using Terraform and Ansible.


Features

  • Automated GCP infrastructure (VPC, firewall, VMs)
  • One-command Hadoop/HDFS installation and configuration
  • Secure SSH access (restricted to your IP)
  • Easy scaling and cleanup

Prerequisites


Quick Start

1. Authenticate with GCP

gcloud auth login
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

2. Configure Deployment

Edit providers/gcp/terraform/terraform.tfvars:

gcp_project_id = "your-gcp-project-id"
gcp_region     = "us-central1"
gcp_zone       = "us-central1-a"
instance_type  = "e2-medium"
image          = "ubuntu-os-cloud/ubuntu-2204-lts"
worker_count   = 2
allow_ssh_cidr = "YOUR_IP/32"  # get your IP: curl -4 ifconfig.me
ssh_public_key_file = "~/.ssh/id_rsa.pub"

3. Deploy the Cluster

chmod +x hdfs-deploy.sh
./hdfs-deploy.sh deploy

4. Access the Cluster

  • NameNode Web UI: http://MASTER_IP:9870
  • Secondary NameNode UI: http://MASTER_IP:9868
  • SSH to Master: ssh ubuntu@MASTER_IP

5. Test HDFS

ssh ubuntu@MASTER_IP
sudo su - hadoop
hdfs dfs -ls /
hdfs dfs -mkdir /test
hdfs dfs -put /etc/hosts /test/
hdfs dfsadmin -report

6. Destroy Resources

./hdfs-deploy.sh destroy

Project Structure

hdfs-cloud-starter/
├── hdfs-deploy.sh
├── generate-inventory.sh
├── providers/
│   └── gcp/
│       └── terraform/
│           ├── main.tf
│           ├── variables.tf
│           ├── outputs.tf
│           └── terraform.tfvars
├── ansible/
│   ├── hdfs-cluster.yml
│   ├── inventory.yml
│   └── templates/
│       ├── core-site.xml.j2
│       ├── hdfs-site.xml.j2
│       └── workers.j2
└── examples/
    └── word-count/

Scaling & Customization

  • More workers: Change worker_count in terraform.tfvars
  • Instance type: Change instance_type
  • Hadoop version/config: Edit ansible/hdfs-cluster.yml and templates

Troubleshooting

  • SSH issues: Check your IP and update allow_ssh_cidr
  • Instance status: gcloud compute instances list
  • Service status: SSH to master, run jps and check logs

License

MIT License

About

Automate the deployment of a production-ready Hadoop HDFS cluster on GCP & AWS using Terraform and Ansible.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published