-
Notifications
You must be signed in to change notification settings - Fork 100
release 3.2.0 -- prep for nanoRNAseq #322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
bambu version
…h existing nf-core modules
Update 3_1 dev meta
With the updated biocontainer for m6anet being built, can update the singularity and docker container with the latest version. This version changed how m6anet is called. It has `m6anet` and the analysis as a parameter (`inference`, `dataprep`) instead of `m6anet-run_inference` and `m6anet-dataprep` commands. There are now two files for output from `m6anet inference`, but the existing output of '*' captures those. In the future, might be nice to specify these individually and emit them as separate outputs for downstream analysis.
I updated the samplesheet to output the filename that was attempted to be used when looking for the fastq file. When running the pipeline with multiple samples, I was unable to determine which sample was causing the samplesheet error until I added this code.
Updated m6anet to 2.0.2
…mplesheet Samplesheet error message updated with more useful information if fastq can't be found.
…latest ones did not work
Added ToulligQC in raw read QC step
update to Template 3.2.1 and all nf-core/modules; add dorado + ToulligQC; remove DNA functionalities
|
awgymer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the efforts being made here but I don't think this PR is ready for a release just yet. I think some of the debug/test profiles may not work either but I haven't confirmed this locally.
I've left a few specific comments but I think there might be benefit in some more restructuring/trimming too.
I also think there would be big benefit in the validation of the samplesheet and parameters being more removed from the pipeline logic and handled by nf-schema.
| pull_request_target: | ||
| branches: [master] | ||
| branches: | ||
| - main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master/main - are both intended to be supported by nf-core repos now?
| K562,1,,K562_directcDNA_replicate1.fastq.gz,genome.fa,genes.gtf | ||
| K562,2,,K562_directcDNA_replicate2.fastq.gz,genome.fa,genes.gtf | ||
| K562,3,,K562_directRNA_replicate3.fastq.gz,genome.fa,genes.gtf | ||
| sample,fastq_1,fastq_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unchanged from the template. I would either remove or update to reflect a real samplesheet format for nanoseq
| @@ -1,5 +1,5 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unchanged from template. Can either remove or consider updating to reflect the allowed samplesheet formats for nanoseq. Even if this file is unused right now (e.g. no nf-schema) an accurate schema can really help users.
| // This is an example of how to use getGenomeAttribute() to fetch parameters | ||
| // from igenomes.config using `--genome` | ||
| params.fasta = getGenomeAttribute('fasta') | ||
| params.gtf = getGenomeAttribute('gtf') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These params need to be initialised above the include for NANOSEQ or these defaults won't be propogated into that workflow.
You have to include { getGenomeAttribute } then put these lines, and then include the workflows.
| @@ -21,6 +21,24 @@ if (params.input) { | |||
| exit 1, 'Input samplesheet not specified!' | |||
| } | |||
|
|
|||
| if (params.fasta){ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed if getGenomeAttribute gets put in the proper place in the main.nf
|
|
||
| script: | ||
| """ | ||
| dorado aligner --mm2-preset map-ont $fasta $mod_bam > aligned.bam && samtools sort aligned.bam -o aligned_sorted.bam && samtools index aligned_sorted.bam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider piping the outputs rather than chaining with &&
| BAM_STATS_SAMTOOLS ( ch_bam_bai, ch_fasta ) | ||
| ch_stats = BAM_STATS_SAMTOOLS.out.stats | ||
| ch_flagstat = BAM_STATS_SAMTOOLS.out.flagstat | ||
| ch_idxstats = BAM_STATS_SAMTOOLS.out.idxstats | ||
|
|
||
| emit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider naming the outputs for better clarity
| bam_index_extension = 'bai' | ||
| cigar_paf_format = false | ||
| cigar_bam = false | ||
| if (params.call_variants) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there actually a difference in these two calls to MINIMAP2 here? I can't see any custom config for these aliases which would indicate they produce different outputs.
Additionally access of params here is against the newest guidelines for nextflow and will be deprecated shortly.
| .set { ch_sample_gtf } | ||
| STRINGTIE_STRINGTIE ( ch_sorted_bam, ch_annotation_gtf ) | ||
| ch_stringtie_gtf = STRINGTIE_STRINGTIE.out.transcript_gtf | ||
| stringtie2_version = STRINGTIE_STRINGTIE.out.versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Versions can be mixed into a single ch_versions for each subworkflow and emitted as a single versions channel and then mixed into the main ch_versions of the parent workflow, unless there is good reason to keep them separate. (This applies to all the local subworkflows)
| ch_software_versions = ch_software_versions.mix(CUSTOM_GETCHROMSIZES.out.versions.first().ifEmpty(null)) | ||
|
|
||
| // will add the following in when nf-core/modules/minimap2/align supports junction bed input | ||
| //GTF2BED ( ch_chr_sizes ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GTF2BED not used? Consider removing the module from the include and the repo generally
fbdtemme
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left some comments, mostly related to more closely following nf-core/modules style in the local modules.
| // Input data to perform both basecalling and demultiplexing | ||
| input = 'https://raw.githubusercontent.com/yuukiiwa/test-datasets/nanoseq/3.2/samplesheet/samplesheet_bc_dx.csv' | ||
| fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/nanoseq/reference/hg19_KCMF1.fa' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using the params.pipelines_testdata_base_path here instead to make the test inputs more generic.
There are quite some other places in the different .config files where the same can be done.
| // Limit resources so that this can run on GitHub Actions | ||
| max_cpus = 2 | ||
| max_memory = '6.GB' | ||
| max_time = '6.h' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are now replaced with the process.resourceLimits from above so they can be removed.
| path fasta | ||
|
|
||
| output: | ||
| tuple val(meta), path("aligned_sorted.bam"), path("*.bai") , emit: aligned_bam |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very small thing, but I find it a little confusing to call this bam bai combined channel aligned_bam.
I would just expect [ meta, bam ] from that name.
Maybe split it up
tuple val(meta), path("aligned_sorted.bam") , emit: bam
tuple val(meta), path("*.bai") , emit: bai
or rename to aligned_bam_bai ?
| path "versions.yml" , emit: versions | ||
|
|
||
| script: | ||
| def emit_args = (params.dorado_modification == null) ? " --emit-fastq > basecall.fastq && gzip basecall.fastq" : " --modified-bases $params.dorado_modification > basecall.bam" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here the comment from @awgymer above:
Pipe directly to gzip instead of chaining with &&
| sample_summary = "$meta.id" +"_summary.txt" | ||
| sample_eventalign = "$meta.id" +"_eventalign.txt" | ||
| fastq="$meta.id"+".fastq" | ||
| fast5 = "$meta.fast5" | ||
| fastqi="$fastq"+"*" | ||
| """ | ||
| gunzip -c $fastqgz > $fastq | ||
| f5c index -d $fast5 $fastq | ||
| echo $fastqi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is slightly cleaner to use interpolation here instead of the string concatenation with +
| sample_summary = "$meta.id" +"_summary.txt" | |
| sample_eventalign = "$meta.id" +"_eventalign.txt" | |
| fastq="$meta.id"+".fastq" | |
| fast5 = "$meta.fast5" | |
| fastqi="$fastq"+"*" | |
| """ | |
| gunzip -c $fastqgz > $fastq | |
| f5c index -d $fast5 $fastq | |
| echo $fastqi | |
| def sample_summary = "${meta.id}_summary.txt" | |
| def sample_eventalign = "${meta.id}_eventalign.txt" | |
| def fastq ="${meta.id}.fastq" | |
| def fast5 = meta.fast5 | |
| """ | |
| gunzip -c $fastqgz > $fastq | |
| f5c index -d $fast5 $fastq | |
| echo ${fastq}* |
| path "test-datasets/fast5/$barcoded/*" , emit: ch_input_fast5s_path | ||
| path "test-datasets/fast5/$barcoded/" , emit: ch_input_fast5_dir_path | ||
| path "test-datasets/modification_fast5_fastq/", emit: ch_input_dir_path | ||
| path "test-datasets/pod5/" , emit: ch_pod5_dir_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the ch_ here from the output channel names.
| cat <<-END_VERSIONS > versions.yml | ||
| "${task.process}": | ||
| m6anet: \$( echo 'm6anet 1.0' ) | ||
| m6anet: \$( echo 'm6anet 2.0.2' ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not call m6anet --version here?
It looks like its supported in m6anaet
| script: | ||
| bedmethyl = "$meta.id" +".bed" | ||
| """ | ||
| modkit pileup $aligned_mod_bam $bedmethyl --threads $task.cpus | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to use the nf-core/modules standard prefix here:
| script: | |
| bedmethyl = "$meta.id" +".bed" | |
| """ | |
| modkit pileup $aligned_mod_bam $bedmethyl --threads $task.cpus | |
| script: | |
| def prefix = task.ext.prefix ?: meta.id | |
| """ | |
| modkit pileup $aligned_mod_bam "${prefix}.bed" --threads $task.cpus |
| script: | ||
| sample_summary = "$meta.id" +"_summary.txt" | ||
| sample_eventalign = "$meta.id" +"_eventalign.txt" | ||
| fast5 = "$meta.fast5" | ||
| """ | ||
| nanopolish index -d $fast5 $fastq | ||
| nanopolish eventalign --reads $fastq --bam $bam --genome $genome --scale-events --signal-index --summary $sample_summary --threads $task.cpus > $sample_eventalign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Cleaner to use the standard prefix variable:
| script: | |
| sample_summary = "$meta.id" +"_summary.txt" | |
| sample_eventalign = "$meta.id" +"_eventalign.txt" | |
| fast5 = "$meta.fast5" | |
| """ | |
| nanopolish index -d $fast5 $fastq | |
| nanopolish eventalign --reads $fastq --bam $bam --genome $genome --scale-events --signal-index --summary $sample_summary --threads $task.cpus > $sample_eventalign | |
| script: | |
| def prefix = task.ext.prefix ?: meta.id | |
| fast5 = "$meta.fast5" | |
| """ | |
| nanopolish index -d $fast5 $fastq | |
| nanopolish eventalign --reads $fastq \\ | |
| --bam $bam \\ | |
| --genome $genome \\ | |
| --scale-events --signal-index \\ | |
| --summary ${prefix}_summary.txt \\ | |
| --threads $task.cpus > ${prefix}_eventalign |
|
|
||
| script: // This script is bundled with the pipeline, in nf-core/nanoseq/bin/ | ||
| updated_path = workflow.profile.contains('test_nodx_rnamod') ? "$input_path" : "not_changed" | ||
| updated_path = (workflow.profile.contains('test_bc_nodx') || workflow.profile.contains('rnamod')) ? "$input_path" : "not_changed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its cleaner here to pass in workflow.profile as a val input to the process instead of accessing the global workflow object from inside a process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line should be changed
from:
ch_fastqc_multiqc.ifEmpty([]),
to:
ch_fastqc_multiqc.collect().ifEmpty([]),
No description provided.