-
Notifications
You must be signed in to change notification settings - Fork 2
Add Deployment guide #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add Deployment guide #2
Conversation
Summary of ChangesHello @echo-chamber0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new, extensive deployment guide for the Data Commons Accelerator, specifically tailored for deployment through the Google Cloud Marketplace into a Google Kubernetes Engine (GKE) cluster. The guide aims to provide users with clear, step-by-step instructions and best practices for setting up, configuring, and managing their Data Commons instance, ensuring a smooth and secure deployment experience from initial setup to post-deployment operations and cleanup. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a comprehensive deployment guide for using the Data Commons Accelerator on GCP Marketplace. The guide is well-structured and detailed, covering the entire process from prerequisites to deletion. I've provided some feedback to improve clarity, correctness, and adherence to best practices. My comments include fixing a broken link in the table of contents, correcting potentially misleading information about data loss and downtime, improving the robustness of gcloud commands, and fixing a minor formatting issue in a markdown table. Overall, this is an excellent addition to the project's documentation.
| - **Field name**: `cloudsql_ha_enabled` | ||
| - **What it means**: Enables automatic database replication to a different availability zone | ||
| - **How it works**: | ||
| - **Disabled (default)**: Single database instance in one zone. If zone fails, data is lost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statement "If zone fails, data is lost" for a non-HA CloudSQL instance is misleading. Standard Cloud SQL instances have point-in-time recovery and automated backups enabled by default. In a zonal failure, the instance can be restored from a backup in another zone. While this involves downtime and potential loss of data since the last backup, the data is not permanently lost. Please clarify this to provide a more accurate picture of the failure scenario.
| - **Disabled (default)**: Single database instance in one zone. If zone fails, data is lost | |
| - **Disabled (default)**: Single database instance in one zone. If a zone fails, the instance is unavailable and must be restored from backup, which can cause downtime and data loss. |
docs/DEPLOYMENT_GUIDE.md
Outdated
| 4. [Deployment via GCP Marketplace](#deployment-via-gcp-marketplace) | ||
| 5. [Accessing Your Deployment](#accessing-your-deployment) | ||
| 6. [Using Data Commons Accelerator](#using-data-commons-accelerator) | ||
| 7. [Managing Your Deployment](#managing-your-deployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| gcloud compute addresses list --global \ | ||
| --filter="purpose=VPC_PEERING AND network~YOUR_VPC_NAME" \ | ||
| --format="table(name,address,prefixLength,network)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gcloud command uses the ~ operator to filter by network name. This operator performs a regular expression match. For matching a specific VPC name, it is safer and clearer to use the exact match operator =. This prevents unexpected behavior if the VPC name contains characters with special meaning in regular expressions.
| gcloud compute addresses list --global \ | |
| --filter="purpose=VPC_PEERING AND network~YOUR_VPC_NAME" \ | |
| --format="table(name,address,prefixLength,network)" | |
| gcloud compute addresses list --global \ | |
| --filter="purpose=VPC_PEERING AND network=YOUR_VPC_NAME" \ | |
| --format="table(name,address,prefixLength,network)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this suggestion sounds reasonable
| - **What it means**: Enables automatic database replication to a different availability zone | ||
| - **How it works**: | ||
| - **Disabled (default)**: Single database instance in one zone. If zone fails, data is lost | ||
| - **Enabled**: Two instances (primary + replica) in different zones. If one zone fails, automatically switches to replica with zero downtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The claim of "zero downtime" for High Availability failover is an overstatement. While Cloud SQL HA failover is very fast, it is not instantaneous. There is typically a brief period of unavailability, usually lasting a few seconds to a minute. It's more accurate to use terms like "minimal downtime" or "near-zero downtime" to set correct expectations.
| - **Enabled**: Two instances (primary + replica) in different zones. If one zone fails, automatically switches to replica with zero downtime | |
| - **Enabled**: Two instances (primary + replica) in different zones. If one zone fails, it automatically switches to the replica with minimal downtime. |
|
|
||
| - **Field name**: `gcs_bucket_name` | ||
| - **What it means**: Name of the Cloud Storage bucket (like a drive on the cloud) | ||
| - **Format**: Must be globally unique across all GCP projects. Only lowercase letters, numbers, hyphens, periods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guide mentions that periods are allowed in GCS bucket names. While true, it's a GCP best practice to avoid them as they can cause issues with virtual-hosted-style URLs and SSL certificate validation. It would be beneficial to add a note of caution.
| - **Format**: Must be globally unique across all GCP projects. Only lowercase letters, numbers, hyphens, periods | |
| - **Format**: Must be globally unique across all GCP projects. Only lowercase letters, numbers, hyphens, periods (use of periods is not recommended). |
docs/DEPLOYMENT_GUIDE.md
Outdated
|
|
||
| | Tier | CPU | Memory | Best For | | ||
| |------|-----|--------|--------| | ||
| | `small` | 1.0 | 2 GB| Light workloads, <10 concurrent users | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
| # Keep this terminal open; the port-forward will run in the foreground | ||
| # In another terminal, test the application: | ||
| curl http://localhost:8080 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For verifying the application is running, it's more reliable to query a dedicated health check endpoint like /healthz (which is used later in the troubleshooting section) instead of the root URL. A health check endpoint provides a more accurate status of the application's health.
| curl http://localhost:8080 | |
| curl http://localhost:8080/healthz |
|
|
||
| **Solution:** | ||
| 1. Verify your GKE cluster's VPC: `gcloud container clusters describe CLUSTER --format="value(network)"` | ||
| 2. Find PSA ranges in that VPC: `gcloud compute addresses list --global --filter="purpose=VPC_PEERING AND network~YOUR_VPC"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gcloud command uses the ~ operator to filter by network name. This operator performs a regular expression match. For matching a specific VPC name, it is safer and clearer to use the exact match operator =. This prevents unexpected behavior if the VPC name contains characters with special meaning in regular expressions.
| 2. Find PSA ranges in that VPC: `gcloud compute addresses list --global --filter="purpose=VPC_PEERING AND network~YOUR_VPC"` | |
| 2. Find PSA ranges in that VPC: `gcloud compute addresses list --global --filter="purpose=VPC_PEERING AND network=YOUR_VPC"` |
|
|
||
| ### What is Data Commons Accelerator? | ||
|
|
||
| Data Commons Accelerator is a ready-to-deploy instance of the Data Commons platform on Google Kubernetes Engine (GKE). Data Commons is an open knowledge repository providing unified access to public datasets and statistics, enabling your organization to explore data without manually aggregating from multiple sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Data Commons Accelerator is a ready-to-deploy instance of the Data Commons platform on Google Kubernetes Engine (GKE). Data Commons is an open knowledge repository providing unified access to public datasets and statistics, enabling your organization to explore data without manually aggregating from multiple sources. | |
| Data Commons Accelerator is a ready-to-deploy instance of [Custom Data Commons](https://docs.datacommons.org/custom_dc/) on Google Kubernetes Engine (GKE). [Data Commons](https://datacommons.org) is an open knowledge repository providing unified access to public datasets and statistics, enabling your organization to explore data without manually aggregating from multiple sources. |
|
|
||
| ### What Problems Does It Solve? | ||
|
|
||
| Data Commons Accelerator addresses these common data exploration challenges: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Data Commons Accelerator addresses these common data exploration challenges: | |
| Data Commons addresses these common data exploration challenges: |
|
|
||
| - **Data Commons Accelerator Web Application**: Interactive interface for data exploration and visualization | ||
| - **CloudSQL MySQL Database**: Persistent storage for datasets and metadata (with optional high availability) | ||
| - **Cloud Storage Bucket**: Scalable storage for custom data imports and exports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Cloud Storage Bucket**: Scalable storage for custom data imports and exports | |
| - **Cloud Storage Bucket**: Scalable storage for custom data imports |
| - **Data Commons Accelerator Web Application**: Interactive interface for data exploration and visualization | ||
| - **CloudSQL MySQL Database**: Persistent storage for datasets and metadata (with optional high availability) | ||
| - **Cloud Storage Bucket**: Scalable storage for custom data imports and exports | ||
| - **Kubernetes Workload**: Application deployed to your existing GKE cluster with Workload Identity authentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we create a new gke cluster or reuse an existing clsuter?
|
|
||
| ### Components | ||
|
|
||
| The Data Commons Accelerator solution consists of four primary components: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
- GKE data ingestion container
- redis cache (memorystore?)
|
|
||
| ## Using Data Commons Accelerator | ||
|
|
||
| ### Key Features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove this section and just point to the doc site
|
|
||
| To get started using Data Commons: | ||
|
|
||
| - **Official Tutorials**: https://datacommons.org/tutorials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this link looks broken
|
|
||
| - **Official Tutorials**: https://datacommons.org/tutorials | ||
| - **API Documentation**: https://docs.datacommons.org/api | ||
| - **Knowledge Graph Explorer**: https://datacommons.org/ (official site for learning about available data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Knowledge Graph Explorer**: https://datacommons.org/ (official site for learning about available data) | |
| - **Knowledge Graph Explorer**: https://datacommons.org/browser (official site for learning about available data) |
|
|
||
| **Troubleshooting Steps:** | ||
|
|
||
| **1. Verify Application Pods Are Running** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these steps look the same as the verify deployment section. is there a good way to consolidate them?
|
|
||
| These resources are automatically deleted: | ||
|
|
||
| - Kubernetes namespace and all pods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memorystore instance
Add deployment guide how to use GCP marketplace to deploy custom dc into GKE