Skip to content

Epic 0.2: Notebook Investigation Experience #93

@bordumb

Description

@bordumb

Epic 0.2: Notebook Investigation Experience

Goal: A data scientist can run investigations directly in Jupyter with rich output.
User Value: Stay in the notebook environment for debugging—no context switching.
Competitor Weakness: Monte Carlo has no notebook story; GX is validation-only; Soda is testing-only.


Task 0.2.1: Enhanced %dataing ask with Rich Output

Title: Upgrade notebook ask command with streaming rich widgets

Description:
The existing %dataing ask magic should be enhanced with proper Jupyter widgets for streaming output. Hypotheses, queries, and evidence should render as interactive collapsible sections. The final synthesis should display with formatting.

Why: Notebooks are the natural home for data scientists. Rich output makes investigations feel native to the Jupyter experience.

Acceptance Criteria:

  • %dataing ask "<question>" streams investigation to output
  • Each hypothesis renders as collapsible accordion widget
  • SQL queries render with syntax highlighting (pygments)
  • Query results show as pandas DataFrames (truncated)
  • Evidence sections show support/refute badges
  • Final synthesis renders as styled HTML box
  • Confidence displayed as progress bar
  • Timeline visualization shows investigation flow
  • Output is reproducible—re-running cell shows cached result
  • %%dataing ask cell magic for multi-line questions

Key Design Notes:

  • Use ipywidgets for interactive elements
  • Fallback to plain text for non-widget environments (VS Code, etc.)
  • Cache investigation results in notebook metadata for reproducibility

Key APIs:

  • Existing investigation APIs
  • SSE stream handling

Dependencies:

  • Existing notebook magic infrastructure

Risks + Mitigations:

  • Risk: Widget rendering varies across Jupyter environments → Mitigation: Feature detection, graceful fallback
  • Risk: Large DataFrames crash output → Mitigation: Always truncate, show row count

Effort: M (4-5 days)

Designation: OSS


Task 0.2.2: Notebook Lineage Visualization

Title: Interactive lineage graph in notebook cells

Description:
%dataing lineage should render an interactive lineage graph showing upstream and downstream dependencies of the current context. Clicking nodes should show dataset details. The graph should support pan/zoom.

Why: Understanding data flow is essential for debugging. An interactive graph in the notebook keeps engineers in their environment.

Acceptance Criteria:

  • %dataing lineage renders graph for current context
  • %dataing lineage <dataset> renders graph for specific dataset
  • --depth <n> controls traversal depth (default 2)
  • --direction upstream|downstream|both controls direction
  • Nodes show dataset name, type (table/view), and data source
  • Edges show job names and last run time
  • Clicking node shows panel with: schema, metrics, recent investigations
  • Jobs (transformations) shown as diamond nodes
  • Pan/zoom/reset controls
  • Export to PNG/SVG
  • Fallback to ASCII art when widgets unavailable

Key Design Notes:

  • Use pyvis or ipycytoscape for graph rendering
  • Layout algorithm: dagre for hierarchy
  • Color coding: tables=blue, views=green, current=highlighted

Key APIs:

  • GET /api/v1/lineage/graph (exists)
  • GET /api/v1/lineage/job/{id} (exists)

Dependencies:

  • Task 0.1.4 (context management)

Risks + Mitigations:

  • Risk: Large lineage graphs unreadable → Mitigation: Collapse distant nodes, expand on click
  • Risk: Different notebook environments → Mitigation: Multiple backends (pyvis, graphviz, ASCII)

Effort: M (4-5 days)

Designation: OSS


Task 0.2.3: Investigation History and Replay

Title: Investigation history browser in notebook

Description:
%dataing history should show past investigations with the ability to load and replay them. This enables comparing investigations over time and building institutional knowledge.

Why: Investigations are valuable artifacts. Being able to browse and replay them turns debugging sessions into reusable knowledge.

Acceptance Criteria:

  • %dataing history shows list of recent investigations
  • Each entry shows: dataset, goal, status, duration, root cause summary
  • %dataing history --dataset <id> filters to specific dataset
  • %dataing history --days <n> filters by recency
  • Clicking entry loads investigation details
  • %dataing replay <investigation_id> loads investigation into context
  • Replayed investigation shows all evidence and synthesis
  • Compare mode: %dataing compare <id1> <id2> shows diff
  • Pagination for large history

Key Design Notes:

  • Use existing GET /api/v1/investigations endpoint
  • History widget as selectable list
  • Compare mode highlights different hypotheses and findings

Key APIs:

  • GET /api/v1/investigations (exists)
  • GET /api/v1/investigations/{id} (exists)

Dependencies:

  • Task 0.2.1 (rich output infrastructure)

Risks + Mitigations:

  • Risk: Large history slows load → Mitigation: Pagination, lazy loading

Effort: S (3 days)

Designation: OSS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions