Skip to content

Conversation

@edmundmiller
Copy link
Member

@edmundmiller edmundmiller commented Oct 27, 2025

Summary

Add automatic upload of workflow outputs to Seqera Platform datasets, enabling seamless integration between Nextflow workflows and Platform's dataset management features.

Features

1. Automatic Dataset Creation & Upload

  • Auto-create datasets for workflow outputs with index files
  • Upload CSV/TSV index files to Platform datasets
  • Configurable naming pattern with workflow metadata variables
  • Per-output configuration for granular control

2. Manual HTTP Implementation

  • Uses manual HTTP requests with multipart/form-data encoding
  • Leverages existing TowerClient infrastructure
  • No external SDK dependencies
  • Works in CI without GitHub authentication requirements

3. Configuration Options

tower {
    datasets {
        enabled = true
        createMode = 'auto'  // or 'existing'
        namePattern = '${workflow.runName}-outputs'
        
        perOutput {
            'my_output' {
                enabled = true
                datasetId = 'existing-dataset-id'  // optional
            }
        }
    }
}

Implementation Details

Dataset Upload Flow:

  1. Listen for WorkflowOutputEvent events via TraceObserverV2
  2. Collect outputs with index files (CSV/TSV)
  3. On workflow completion, create dataset(s) as configured
  4. Upload index files using multipart HTTP POST

HTTP Implementation:

  • createDataset() - POST JSON payload to datasets API
  • uploadFile() - Multipart/form-data file upload per RFC 2388
  • createMultipartBody() - Manual multipart encoding
  • Uses existing HttpClient from TowerClient infrastructure

Configuration Classes:

  • DatasetConfig - Main configuration with validation
  • Support for auto/existing modes and per-output settings

Testing

Unit Tests (3 tests)

  • Workflow output event collection
  • Index file detection
  • Configuration validation

Integration Test

  • Real API upload (conditional on TOWER_ACCESS_TOKEN)
  • End-to-end validation with actual Platform

Validation Workflow

  • Manual test workflow in validation/dataset-upload/
  • Demonstrates complete dataset upload flow
  • Includes comprehensive testing guide

Documentation

  • Configuration reference in DatasetConfig
  • Validation workflow README with troubleshooting
  • Prerequisites: Nextflow 25.10.0+ for output {} block support

Architecture Decision

Why Manual HTTP vs tower-java-sdk?

Initially refactored to use tower-java-sdk, but reverted because:

  • SDK is hosted on GitHub Packages which requires authentication
  • CI builds cannot access GitHub Packages without credentials
  • Manual HTTP implementation is simpler and has no external dependencies
  • Uses existing TowerClient HTTP infrastructure

The manual implementation maintains identical functionality while ensuring CI compatibility.

Breaking Changes

None - feature is opt-in via configuration.

Related Issues

Checklist

  • Implemented manual HTTP multipart upload
  • Added comprehensive unit tests
  • Added integration test
  • Created validation workflow
  • Fixed workflow output DSL syntax
  • Updated documentation
  • All tests passing
  • Works in CI without authentication

@netlify
Copy link

netlify bot commented Oct 27, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 687b0c3
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69009f52888d130008c6fe31

Implement automatic upload of Nextflow workflow output index files to
Seqera Platform datasets when workflows complete, enabling seamless
integration between Nextflow's output syntax and Platform's dataset
management.

Changes:
- Add DatasetConfig class for dataset upload configuration
  - Support auto-create or use existing datasets
  - Customizable dataset name patterns with variable substitution
  - Per-output configuration overrides
- Update TowerConfig to include datasets configuration scope
- Implement dataset upload in TowerClient:
  - Collect workflow outputs via onWorkflowOutput() callback
  - Upload index files on workflow completion (onFlowComplete)
  - Create datasets via Platform API with proper workspace URLs
  - Use multipart/form-data for file uploads (matches tower-cli)
  - Add URL builders for dataset API endpoints
- Add comprehensive unit tests for DatasetConfig

API Implementation:
- Create dataset: POST /workspaces/{id}/datasets/
- Upload file: POST /workspaces/{id}/datasets/{id}/upload
- Proper multipart/form-data format with file field
- Workspace ID in URL path (not query param)
- Header detection via ?header=true query parameter

Configuration example:
  tower {
    datasets {
      enabled = true
      createMode = 'auto'
      namePattern = '${workflow.runName}-outputs'
      perOutput {
        'results' { datasetId = 'existing-id' }
      }
    }
  }

Based on research of tower-cli (v0.15.0) and Seqera Platform API
documentation to ensure correct endpoint structure and payload format.

Signed-off-by: Edmund Miller <edmund.a.miller@gmail.com>
Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
…et upload

Refactor dataset upload implementation to use the official tower-java-sdk
instead of manual HTTP multipart encoding, significantly simplifying the
code and improving maintainability.

Changes:
- Add tower-java-sdk dependency (1.43.1) with GitHub Packages repository
- Replace manual HTTP implementation with DatasetsApi SDK methods:
  - createDataset() now uses datasetsApi.createDataset(wspId, request)
  - uploadIndexToDataset() now uses datasetsApi.uploadDataset(wspId, id, header, file)
- Remove ~120 lines of manual HTTP code:
  - Deleted getUrlDatasets() and getUrlDatasetUpload() URL builders
  - Deleted uploadFile() multipart HTTP request construction
  - Deleted createMultipartBody() RFC 2388 multipart encoding
- Add comprehensive test coverage:
  - 7 unit tests with mocked DatasetsApi (initialization, event collection, 
    dataset creation, file upload, exception handling)
  - 1 integration test with real Platform API (conditional on TOWER_ACCESS_TOKEN)
  - Manual test workflow in test-dataset-upload/ directory with documentation

Testing:
- All unit tests passing (BUILD SUCCESSFUL)
- Integration test ready (runs when TOWER_ACCESS_TOKEN available)
- Test workflow provides end-to-end validation guide

Benefits:
- Uses official Seqera SDK (same as tower-cli)
- Easier to test with mocked API
- SDK handles all HTTP/multipart details automatically
- Bug fixes in SDK benefit us automatically
- Code reduced from ~300 lines to ~100 lines

Note: Requires GitHub credentials for tower-java-sdk dependency.
Configure github_username and github_access_token in gradle.properties
or set GITHUB_USERNAME and GITHUB_TOKEN environment variables.

Signed-off-by: Edmund Miller <edmund.a.miller@gmail.com>
Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
The tower-java-sdk dependency from GitHub Packages requires authentication
even for public packages, causing CI build failures. This reverts the SDK
refactoring and restores the manual HTTP implementation.

Changes:
- Removed tower-java-sdk dependency from build.gradle
- Restored manual HTTP methods in TowerClient.groovy:
  - getUrlDatasets() and getUrlDatasetUpload() URL helpers
  - createDataset() with JSON payload and sendHttpMessage()
  - uploadFile() multipart HTTP implementation
  - createMultipartBody() RFC 2388 implementation (~120 lines total)
- Simplified TowerClientTest.groovy to remove SDK-specific tests
- Kept core functionality tests and integration test

Functionality remains identical - only the implementation approach changed
from SDK calls to direct HTTP requests. This allows the plugin to build
successfully in CI without requiring GitHub Package authentication.

Signed-off-by: Edmund Miller <edmund.miller@seqera.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant