The current GeneLab Affymetrix Microarray consensus processing pipeline (NF_MAAffymetrix), GL-DPPD-7114, is implemented as a Nextflow DSL2 workflow and utilizes Singularity to run all tools in containers. This workflow (NF_MAAffymetrix) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, the Nextflow documentation is a useful resource for users who want to modify and/or extend this workflow.
The NF_MAAffymetrix workflow is composed of three subworkflows as shown in the image above. Below is a description of each subworkflow and the additional output files generated that are not already indicated in the GL-DPPD-7114 pipeline document:
-
Analysis Staging Subworkflow
- Description:
- This subworkflow extracts the metadata parameters (e.g. organism, Array Design REF) needed for processing from the OSD/GLDS ISA archive and retrieves the raw reads files hosted on the Open Science Data Repository (OSDR).
OSD/GLDS ISA archive: ISA directory containing Investigation, Study, and Assay (ISA) metadata files for a respective GLDS dataset - the *ISA.zip file is located in the OSDR under 'Files' -> 'Study Metadata Files' for any GeneLab Data Set (GLDS) in the OSDR.
- This subworkflow extracts the metadata parameters (e.g. organism, Array Design REF) needed for processing from the OSD/GLDS ISA archive and retrieves the raw reads files hosted on the Open Science Data Repository (OSDR).
- Description:
-
Affymetrix Microarray Processing Subworkflow
- Description:
- This subworkflow uses the staged raw data and metadata parameters from the Analysis Staging Subworkflow to generate processed data using the GL-DPPD-7114 pipeline.
- Description:
-
V&V Pipeline Subworkflow
- Description:
-
This subworkflow performs validation and verification (V&V) on the raw and processed data files. It performs a series of checks on the output files generated and flags the results, using the flag codes indicated in the table below, which are outputted into a log file.
V&V Flags:Flag Codes Flag Name Interpretation 2 MANUAL Special flag that indicates a manual check that is advised. Often used to advise what should be visually assessed in QA plots. 20 GREEN Indicates the check passed all validation conditions 30 YELLOW Indicates the check was flagged for minor issues (e.g. slight outliers) 50 RED Indicates the check was flagged for moderate issues (e.g. major outliers) 80 HALT Indicates the check was flagged for severe issues that trigger a processing halt (e.g. missing data)
-
- Description:
- 1. Install Nextflow and Singularity
- 2. Download the Workflow Files
- 3. Run the Workflow
- 4. Additional Output Files
Nextflow can be installed either through Anaconda or as documented on the Nextflow documentation page.
Note: If you want to install Anaconda, we recommend installing a Miniconda, Python3 version appropriate for your system, as instructed by Happy Belly Bioinformatics.
Once conda is installed on your system, you can install the latest version of Nextflow by running the following commands:
conda install -c bioconda nextflow nextflow self-update
Singularity is a container platform that allows usage of containerized software. This enables the GeneLab NF_MAAffymetrix workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.
We recommend installing Singularity on a system wide level as per the associated documentation.
Note: Singularity is also available through Anaconda.
All files required for utilizing the NF_MAAffymetrix GeneLab workflow for processing Affymetrix Microarray data are in the workflow_code directory. To get a copy of latest NF_MAAffymetrix version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_MAAffymetrix_1.0.4/NF_MAAffymetrix_1.0.4.zip
unzip NF_MAAffymetrix_1.0.4.zipWhile in the location containing the NF_MAAffymetrix_1.0.4 directory that was downloaded in step 2, you are now able to run the workflow. Below are three examples of how to run the NF_MAAffymetrix workflow:
Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --ensemblVersion) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
nextflow run NF_MAAffymetrix_1.0.4/main.nf \
-profile singularity \
--osdAccession OSD-266 \
--gldsAccession GLDS-266 Note: Specifications for creating a runsheet manually are described here.
nextflow run NF_MAAffymetrix_1.0.4/main.nf \
-profile singularity \
--runsheetPath </path/to/runsheet> Note: Specifications for the ISA Tab Archive format can be found here.
nextflow run NF_MAAffymetrix_1.0.4/main.nf \
-profile singularity \
--isaArchivePath </path/to/isaArchive> Required Parameters For All Approaches:
-
NF_MAAffymetrix_1.0.4/main.nf- Instructs Nextflow to run the NF_MAAffymetrix workflow -
-profile- Specifies the configuration profile(s) to load,singularityinstructs Nextflow to setup and use singularity for all software called in the workflow
Additional Required Parameters For Approach 1:
-
--osdAccession OSD-###– specifies the OSD ID to process through the NF_MAAffymetrix workflow (replace ### with the OSD number) -
--gldsAccession GLDS-###– specifies the GLDS ID to process through the NF_MAAffymetrix workflow (replace ### with the GLDS number)
Additional Required Parameters For Approach 2:
--runsheetPath- specifies the path to a local runsheet (Default: a runsheet is automatically generated using the metadata on the GeneLab Repository for the GLDS dataset being processed)
Optional Parameters:
-
--skipVV- skip the automated V&V processes (Default: the automated V&V processes are active) -
--resultsDir- specifies the output directory for all files produced by the workflow (Default: <OSD-NNN_GLDS-NNN> if OSD and GLDS accessions are specified. Otherwise, the workflow launch directory.)
All parameters listed above and additional optional arguments for the NF_MAAffymetrix workflow, including debug related options that may not be immediately useful for most users, can be viewed by running the following command:
nextflow run NF_MAAffymetrix_1.0.4/main.nf --helpSee nextflow run -h and Nextflow's CLI run command documentation for more options and details common to all nextflow workflows.
All R code steps and output are rendered within a Quarto document yielding the following:
- Output:
- NF_MAAffymetrix_1.0.4.html (html report containing executed code and output including QA plots)
The outputs from the Analysis Staging and V&V Pipeline Subworkflows are described below:
Note: The outputs from the Affymetrix Microarray Processing Subworkflow are documented in the GL-DPPD-7114.md processing protocol.
Analysis Staging Subworkflow
- Output:
- *_microarray_v1_runsheet.csv (table containing metadata required for processing, including the raw reads files location)
- *-ISA.zip (the ISA archive of the GLDS datasets to be processed, downloaded from the GeneLab Data Repository)
V&V Pipeline Subworkflow
- Output:
- VV_log_VV_AGILE1CH.tsv.MANUAL_CHECKS_PENDING (table containing V&V flags for all checks performed. Also contains rows indicating suggested manual checks focusing on QA plots embedded in the html report)
Standard Nextflow resource usage logs are also produced as follows:
Further details about these logs can also found within this Nextflow documentation page.
Nextflow Resource Usage Logs
- Output:
- Resource_Usage/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
- Resource_Usage/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
- Resource_Usage/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)