Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a Jupyter notebook tool intended to help users estimate the approximate GCP cost of running Jupyter notebooks inside Verily Workbench (compute + data/storage + optional workspace-resource discovery via wb CLI).
Changes:
- Introduces a new
cost_estimator.ipynbnotebook with ipywidgets-driven inputs and cost calculations. - Adds logic to estimate storage/query costs for GCS and BigQuery resources.
- Adds workspace resource discovery and sizing helpers via
wbCLI commands.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "import json\n", | ||
| "import pandas as pd\n", | ||
| "from IPython.display import display, HTML\n", | ||
| "\n", | ||
| "def run_wb_command(command):\n", | ||
| " \"\"\"Run a wb CLI command and return the result\"\"\"\n", | ||
| " result = subprocess.run(command, shell=True, capture_output=True, text=True)\n", | ||
| " if result.returncode == 0:\n", | ||
| " return result.stdout.strip()\n", | ||
| " else:\n", | ||
| " raise Exception(f\"Command failed: {command}\\nError: {result.stderr}\")\n", | ||
| "\n", |
There was a problem hiding this comment.
run_wb_command() uses subprocess.run(..., shell=True) and later interpolates values into commands (e.g., resource names/datasets). This makes command injection possible if any interpolated value contains shell metacharacters. Prefer shell=False with an argv list and pass arguments as separate list items.
| "import json\n", | |
| "import pandas as pd\n", | |
| "from IPython.display import display, HTML\n", | |
| "\n", | |
| "def run_wb_command(command):\n", | |
| " \"\"\"Run a wb CLI command and return the result\"\"\"\n", | |
| " result = subprocess.run(command, shell=True, capture_output=True, text=True)\n", | |
| " if result.returncode == 0:\n", | |
| " return result.stdout.strip()\n", | |
| " else:\n", | |
| " raise Exception(f\"Command failed: {command}\\nError: {result.stderr}\")\n", | |
| "\n", | |
| "import shlex\n", | |
| "import json\n", | |
| "import pandas as pd\n", | |
| "from IPython.display import display, HTML\n", | |
| "\n", | |
| "def run_wb_command(command):\n", | |
| " \"\"\"Run a wb CLI command and return the result.\n", | |
| "\n", | |
| " Accepts either a string command or an iterable of arguments.\n", | |
| " \"\"\"\n", | |
| " if isinstance(command, str):\n", | |
| " cmd = shlex.split(command)\n", | |
| " else:\n", | |
| " cmd = list(command)\n", | |
| "\n", | |
| " result = subprocess.run(cmd, capture_output=True, text=True)\n", | |
| " if result.returncode == 0:\n", | |
| " return result.stdout.strip()\n", | |
| " else:\n", | |
| " raise Exception(f\"Command failed: {command}\\nError: {result.stderr}\")\n", | |
| "\n", |
| " resource_type = resources_df.iloc[i-1]['Type']\n", | ||
| " stewardship = resources_df.iloc[i-1]['Stewardship']\n", | ||
| " print(f\" {i}. {name} ({resource_type}, {stewardship})\")\n", |
There was a problem hiding this comment.
This loop reuses the name resource_type, which earlier refers to the ipywidgets dropdown. Shadowing it with a string from the dataframe can break subsequent cells that expect resource_type.value. Rename the loop variable (e.g., resource_type_str) to avoid clobbering the widget.
| " resource_type = resources_df.iloc[i-1]['Type']\n", | |
| " stewardship = resources_df.iloc[i-1]['Stewardship']\n", | |
| " print(f\" {i}. {name} ({resource_type}, {stewardship})\")\n", | |
| " resource_type_str = resources_df.iloc[i-1]['Type']\n", | |
| " stewardship = resources_df.iloc[i-1]['Stewardship']\n", | |
| " print(f\" {i}. {name} ({resource_type_str}, {stewardship})\")\n", |
| " elif resource_type == 'mixed':\n", | ||
| " # Combination of storage and BigQuery\n", | ||
| " gcs_storage = storage_price_per_gb_hour * data_gb * runtime_hrs\n", | ||
| " bq_queries = (processed_gb / 1000) * bq_query_price_per_tb * queries\n", | ||
| " storage_cost = gcs_storage + bq_queries\n", |
There was a problem hiding this comment.
In the mixed branch, query costs are added into storage_cost and query_cost remains 0, but downstream code expects query costs in query_cost (it prints them only if query_cost > 0). Return the mixed query component via query_cost and keep storage in storage_cost so totals/breakdowns are correct.
| "notebook_file = widgets.Text(\n", | ||
| " value='',\n", | ||
| " placeholder='Enter notebook filename (e.g., analysis.ipynb)',\n", | ||
| " description='Notebook:',\n", | ||
| " disabled=False\n", |
There was a problem hiding this comment.
The notebook_file widget is collected from the user but never read anywhere in the notebook (it’s only added to the UI). Either use it (e.g., to validate the notebook exists / extract metadata) or remove it to avoid confusing users.
| " \n", | ||
| " except Exception as e:\n", | ||
| " print(f\" ⚠️ Could not determine size: {str(e)}\")\n", | ||
| " \n", |
There was a problem hiding this comment.
total_estimated_cost is incremented here but is never initialized anywhere in the notebook, so this will raise NameError. Initialize it before the loop (and decide whether it’s meant to start at 0 or include previously computed totals).
| " \n", | |
| " \n", | |
| " # Initialize total_estimated_cost before accumulating compute costs\n", | |
| " total_estimated_cost = 0.0\n", |
| " print(f\" 💻 Compute cost ({runtime_hours.value}h): ${compute_cost:.2f}\")\n", | ||
| " print(f\" 💾 Storage cost ({total_storage_gb:.1f}GB): ${total_estimated_cost - compute_cost - total_query_costs:.4f}\")\n", | ||
| " if total_query_costs > 0:\n", | ||
| " print(f\" 🔍 Query costs: ${total_query_costs:.4f}\")\n", |
There was a problem hiding this comment.
This print statement references total_storage_gb and total_query_costs, but neither variable is defined in the notebook scope at this point. Aggregate these totals explicitly (e.g., sum per-resource sizes/costs) before printing the final breakdown.
| "special_resources = widgets.Text(\n", | ||
| " value='',\n", | ||
| " placeholder='e.g., GPU, highmem',\n", | ||
| " description='Special Resources:',\n", | ||
| " disabled=False\n", |
There was a problem hiding this comment.
The special_resources widget is never used in any estimation logic. If the estimator is meant to account for GPUs/highmem, wire this input into the compute pricing/model selection; otherwise remove it to avoid misleading output.
| "default_machine_type = 'n1-standard-4'\n", | ||
| "default_vcpu = 4\n", | ||
| "default_ram_gb = 15\n", | ||
| "compute_price_per_hour = 0.158 # USD/hr (update if pricing changes)\n", |
There was a problem hiding this comment.
This code contains non-ASCII whitespace characters (e.g., NBSP) — visible here after the numeric literal — which can cause Python SyntaxError: invalid non-printable character or indentation errors when executed. Please replace these with normal spaces (U+0020) throughout the notebook and re-save.
| " if resource_type in ['gcs_bucket', 'gcs_object', 'mixed']:\n", | ||
| " # Cloud Storage costs for data + output\n", | ||
| " total_storage_gb = data_gb + output_gb\n", | ||
| " storage_cost = storage_price_per_gb_hour * total_storage_gb * runtime_hrs\n", | ||
| " \n", |
There was a problem hiding this comment.
mixed is included in this Cloud Storage branch, so a later elif resource_type == 'mixed' branch (below) will never run. Handle mixed in its own branch (or remove the dead branch) so mixed-resource estimates are computed as intended.
| "- Resource type: {resource_explanation.get(resource_type.value, resource_type.value)}\n", | ||
| "- Storage price: ${storage_price_per_gb_month}/GB/month (Cloud Storage Standard)\"\"\"\n", | ||
| "\n", | ||
| "if resource_type.value in ['bq_dataset', 'bq_table']:\n", |
There was a problem hiding this comment.
The generated explanation always includes the Cloud Storage Standard price, even when the selected resource type is BigQuery. This makes the displayed “Storage price” misleading for bq_dataset/bq_table; adjust the text to show BigQuery storage pricing when applicable (and/or omit GCS pricing for BQ-only flows).
| "- Resource type: {resource_explanation.get(resource_type.value, resource_type.value)}\n", | |
| "- Storage price: ${storage_price_per_gb_month}/GB/month (Cloud Storage Standard)\"\"\"\n", | |
| "\n", | |
| "if resource_type.value in ['bq_dataset', 'bq_table']:\n", | |
| "- Resource type: {resource_explanation.get(resource_type.value, resource_type.value)}\"\"\"\n", | |
| "\n", | |
| "if resource_type.value in ['gcs_bucket', 'gcs_object', 'mixed']:\n", | |
| " explanation += f\"\"\"\n", | |
| "- Storage price: ${storage_price_per_gb_month}/GB/month (Cloud Storage Standard)\"\"\"\n", | |
| "\n", | |
| "if resource_type.value in ['bq_dataset', 'bq_table', 'mixed']:\n", |
Create a tool that enables users to estimate the cost of running their own Jupyter notebooks in their Workbench workspaces.