Skip to content

Conversation

@echo-chamber0
Copy link

Add deployment guide how to use GCP marketplace to deploy custom dc into GKE

@gemini-code-assist
Copy link

Summary of Changes

Hello @echo-chamber0, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new, extensive deployment guide for the Data Commons Accelerator, specifically tailored for deployment through the Google Cloud Marketplace into a Google Kubernetes Engine (GKE) cluster. The guide aims to provide users with clear, step-by-step instructions and best practices for setting up, configuring, and managing their Data Commons instance, ensuring a smooth and secure deployment experience from initial setup to post-deployment operations and cleanup.

Highlights

  • New Deployment Guide Added: A comprehensive guide for deploying the Data Commons Accelerator through the Google Cloud Marketplace into a GKE cluster has been added.
  • Detailed Prerequisites: The guide outlines all necessary prerequisites, including GKE cluster version requirements, Workload Identity enablement, Private Service Access configuration, and required IAM permissions.
  • Step-by-Step Deployment: It provides a field-by-field walkthrough of the GCP Marketplace deployment form, covering basic configuration, GKE cluster settings, CloudSQL, Cloud Storage, API, and application configurations.
  • Post-Deployment Management: Sections on accessing the deployed application, verifying its status, key features, troubleshooting common issues, and a detailed guide for deleting the deployment and associated resources are included.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a comprehensive deployment guide for using the Data Commons Accelerator on GCP Marketplace. The guide is well-structured and detailed, covering the entire process from prerequisites to deletion. I've provided some feedback to improve clarity, correctness, and adherence to best practices. My comments include fixing a broken link in the table of contents, correcting potentially misleading information about data loss and downtime, improving the robustness of gcloud commands, and fixing a minor formatting issue in a markdown table. Overall, this is an excellent addition to the project's documentation.

- **Field name**: `cloudsql_ha_enabled`
- **What it means**: Enables automatic database replication to a different availability zone
- **How it works**:
- **Disabled (default)**: Single database instance in one zone. If zone fails, data is lost

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The statement "If zone fails, data is lost" for a non-HA CloudSQL instance is misleading. Standard Cloud SQL instances have point-in-time recovery and automated backups enabled by default. In a zonal failure, the instance can be restored from a backup in another zone. While this involves downtime and potential loss of data since the last backup, the data is not permanently lost. Please clarify this to provide a more accurate picture of the failure scenario.

Suggested change
- **Disabled (default)**: Single database instance in one zone. If zone fails, data is lost
- **Disabled (default)**: Single database instance in one zone. If a zone fails, the instance is unavailable and must be restored from backup, which can cause downtime and data loss.

4. [Deployment via GCP Marketplace](#deployment-via-gcp-marketplace)
5. [Accessing Your Deployment](#accessing-your-deployment)
6. [Using Data Commons Accelerator](#using-data-commons-accelerator)
7. [Managing Your Deployment](#managing-your-deployment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Table of Contents includes an entry for "Managing Your Deployment" that links to #managing-your-deployment. However, there is no corresponding section header in the document, which will result in a broken link. Please either add this section or remove the entry from the table of contents.

Comment on lines +196 to +198
gcloud compute addresses list --global \
--filter="purpose=VPC_PEERING AND network~YOUR_VPC_NAME" \
--format="table(name,address,prefixLength,network)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The gcloud command uses the ~ operator to filter by network name. This operator performs a regular expression match. For matching a specific VPC name, it is safer and clearer to use the exact match operator =. This prevents unexpected behavior if the VPC name contains characters with special meaning in regular expressions.

Suggested change
gcloud compute addresses list --global \
--filter="purpose=VPC_PEERING AND network~YOUR_VPC_NAME" \
--format="table(name,address,prefixLength,network)"
gcloud compute addresses list --global \
--filter="purpose=VPC_PEERING AND network=YOUR_VPC_NAME" \
--format="table(name,address,prefixLength,network)"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this suggestion sounds reasonable

- **What it means**: Enables automatic database replication to a different availability zone
- **How it works**:
- **Disabled (default)**: Single database instance in one zone. If zone fails, data is lost
- **Enabled**: Two instances (primary + replica) in different zones. If one zone fails, automatically switches to replica with zero downtime

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The claim of "zero downtime" for High Availability failover is an overstatement. While Cloud SQL HA failover is very fast, it is not instantaneous. There is typically a brief period of unavailability, usually lasting a few seconds to a minute. It's more accurate to use terms like "minimal downtime" or "near-zero downtime" to set correct expectations.

Suggested change
- **Enabled**: Two instances (primary + replica) in different zones. If one zone fails, automatically switches to replica with zero downtime
- **Enabled**: Two instances (primary + replica) in different zones. If one zone fails, it automatically switches to the replica with minimal downtime.


- **Field name**: `gcs_bucket_name`
- **What it means**: Name of the Cloud Storage bucket (like a drive on the cloud)
- **Format**: Must be globally unique across all GCP projects. Only lowercase letters, numbers, hyphens, periods

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The guide mentions that periods are allowed in GCS bucket names. While true, it's a GCP best practice to avoid them as they can cause issues with virtual-hosted-style URLs and SSL certificate validation. It would be beneficial to add a note of caution.

Suggested change
- **Format**: Must be globally unique across all GCP projects. Only lowercase letters, numbers, hyphens, periods
- **Format**: Must be globally unique across all GCP projects. Only lowercase letters, numbers, hyphens, periods (use of periods is not recommended).


| Tier | CPU | Memory | Best For |
|------|-----|--------|--------|
| `small` | 1.0 | 2 GB| Light workloads, <10 concurrent users |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This table row is missing a space before the final pipe |, which will cause it to render incorrectly in some markdown parsers.

Suggested change
| `small` | 1.0 | 2 GB| Light workloads, <10 concurrent users |
| `small` | 1.0 | 2 GB | Light workloads, <10 concurrent users |


# Keep this terminal open; the port-forward will run in the foreground
# In another terminal, test the application:
curl http://localhost:8080

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For verifying the application is running, it's more reliable to query a dedicated health check endpoint like /healthz (which is used later in the troubleshooting section) instead of the root URL. A health check endpoint provides a more accurate status of the application's health.

Suggested change
curl http://localhost:8080
curl http://localhost:8080/healthz


**Solution:**
1. Verify your GKE cluster's VPC: `gcloud container clusters describe CLUSTER --format="value(network)"`
2. Find PSA ranges in that VPC: `gcloud compute addresses list --global --filter="purpose=VPC_PEERING AND network~YOUR_VPC"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The gcloud command uses the ~ operator to filter by network name. This operator performs a regular expression match. For matching a specific VPC name, it is safer and clearer to use the exact match operator =. This prevents unexpected behavior if the VPC name contains characters with special meaning in regular expressions.

Suggested change
2. Find PSA ranges in that VPC: `gcloud compute addresses list --global --filter="purpose=VPC_PEERING AND network~YOUR_VPC"`
2. Find PSA ranges in that VPC: `gcloud compute addresses list --global --filter="purpose=VPC_PEERING AND network=YOUR_VPC"`

@dwnoble dwnoble self-requested a review January 15, 2026 17:54

### What is Data Commons Accelerator?

Data Commons Accelerator is a ready-to-deploy instance of the Data Commons platform on Google Kubernetes Engine (GKE). Data Commons is an open knowledge repository providing unified access to public datasets and statistics, enabling your organization to explore data without manually aggregating from multiple sources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data Commons Accelerator is a ready-to-deploy instance of the Data Commons platform on Google Kubernetes Engine (GKE). Data Commons is an open knowledge repository providing unified access to public datasets and statistics, enabling your organization to explore data without manually aggregating from multiple sources.
Data Commons Accelerator is a ready-to-deploy instance of [Custom Data Commons](https://docs.datacommons.org/custom_dc/) on Google Kubernetes Engine (GKE). [Data Commons](https://datacommons.org) is an open knowledge repository providing unified access to public datasets and statistics, enabling your organization to explore data without manually aggregating from multiple sources.


### What Problems Does It Solve?

Data Commons Accelerator addresses these common data exploration challenges:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Data Commons Accelerator addresses these common data exploration challenges:
Data Commons addresses these common data exploration challenges:


- **Data Commons Accelerator Web Application**: Interactive interface for data exploration and visualization
- **CloudSQL MySQL Database**: Persistent storage for datasets and metadata (with optional high availability)
- **Cloud Storage Bucket**: Scalable storage for custom data imports and exports
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Cloud Storage Bucket**: Scalable storage for custom data imports and exports
- **Cloud Storage Bucket**: Scalable storage for custom data imports

- **Data Commons Accelerator Web Application**: Interactive interface for data exploration and visualization
- **CloudSQL MySQL Database**: Persistent storage for datasets and metadata (with optional high availability)
- **Cloud Storage Bucket**: Scalable storage for custom data imports and exports
- **Kubernetes Workload**: Application deployed to your existing GKE cluster with Workload Identity authentication
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we create a new gke cluster or reuse an existing clsuter?


### Components

The Data Commons Accelerator solution consists of four primary components:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

  • GKE data ingestion container
  • redis cache (memorystore?)


## Using Data Commons Accelerator

### Key Features
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this section and just point to the doc site


To get started using Data Commons:

- **Official Tutorials**: https://datacommons.org/tutorials
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link looks broken


- **Official Tutorials**: https://datacommons.org/tutorials
- **API Documentation**: https://docs.datacommons.org/api
- **Knowledge Graph Explorer**: https://datacommons.org/ (official site for learning about available data)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Knowledge Graph Explorer**: https://datacommons.org/ (official site for learning about available data)
- **Knowledge Graph Explorer**: https://datacommons.org/browser (official site for learning about available data)


**Troubleshooting Steps:**

**1. Verify Application Pods Are Running**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these steps look the same as the verify deployment section. is there a good way to consolidate them?


These resources are automatically deleted:

- Kubernetes namespace and all pods
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memorystore instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants