This is a crosspost from stack overflow (Nextflow - I/O and index issues). Based on this thread (how to handle exact duplicates...) crossposting is allowed on Biostars, but I will delete if necessary.
I am new to nextflow and am trying to build a pipeline. I am having trouble with passing outputs into new processes, properly capturing outputs, and with BWA-MEM2 index designation.
Here is my script:
#!/usr/bin/env nextflow
// Declare syntax version
nextflow.enable.dsl = 2
// Set parameters
params.fastq = "$projectDir/fastq/*_{1,2}.f*"
params.refgenome = "$projectDir/RefGenome/hg38.fa"
workflow {
read_pairs_ch = Channel.fromFilePairs(params.fastq, checkIfExists: true)
trimAndQC(read_pairs_ch)
readMapping(params.refgenome, trimAndQC.out.trimmedreads)
}
process trimAndQC {
debug true
tag "$sample_id"
publishDir 'trimmed/', mode: 'copy', overwrite: false
input:
tuple val(sample_id), path(fastq)
output:
tuple val('sample_id'), path('*_val_1.fq.gz'), path('*_val_2.fq.gz'), emit: trimmedreads
shell:
"""
trim_galore --paired --fastqc ${fastq.get(0)} ${fastq.get(1)}
"""
}
// BWA read mapping
process readMapping {
debug true
tag "$sample_id"
publishDir 'bwa_aligned/', mode: 'copy', overwrite: false
input:
path(refgenome)
tuple val(sample_id), val(trimmedreads)
output:
path('${sample_id}_sorted.bam'), emit: sorted_bam
script:
def indexbase = refgenome.baseName
"""
bwa-mem2 mem -t 2 '${indexbase}' '${trimmedreads}' | samtools sort -@2 -o '${sample_id}_sorted.bam' -
"""
}
First, I am getting this warning:
WARN: Input tuple does not match input set cardinality declared by process `readMapping` -- offending value: [sample_id, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz, /Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_2_val_2.fq.gz]
Which is confusing as my output is a tuple and thus should match the cardinality of the output of trimAndQC
. As an aside, I have noticed most scripts passing the input tuple (in this case path(fastq)
) as a single argument, but to get the command to work I have to use the .get()
accession method to 'unpack' the tuple, so I am unsure what I am doing wrong there.
Second, I am getting this error in regards to the output of readMapping
process:
ERROR ~ Error executing process > 'readMapping (sample_id)'
Caused by:
Missing output file(s) `${sample_id}_sorted.bam` expected by process `readMapping (sample_id)`
Command executed:
bwa-mem2 mem -t 2 'hg38' '/Users/me/Desktop/fastq/work/27/448a89259a9435ff86d52009afa05d/SRR925811_1_val_1.fq.gz' | samtools sort -@2 -o sample_id_sorted.bam -
Command exit status:
0
Command output:
(empty)
I assume this may be due to the cardinality warning above and sample_id
is empty.
Finally, I cannot seem to designate the index properly to the BWA-MEM2 command. The error:
Command error:
Looking to launch executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42", simd = .sse42
Launching executable "/Users/me/opt/anaconda3/envs/nextflow/bin/bwa-mem2.sse42"
-----------------------------
Executing in SSE4.2 mode!!
-----------------------------
* SA compression enabled with xfactor: 8
* Ref file: hg38
* Entering FMI_search
ERROR! Unable to open the file: hg38.bwt.2bit.64
Work dir:
/Users/me/Desktop/fastq/work/23/f8b3b6eace6f11f8619ce435b16725
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
-- Check '.nextflow.log' file for details
In my RefGenome directory I have hg38.fa indexed on an AWS EC2 and downloaded to locally to prototype the pipeline. The files look have these extensions:
hg38.fa.{0123,amb,ann,pac,bwt.2bit.64}
and appear to work fine if I just run the bwa-mem2 mem
command in my terminal.
I have tried to adapt multiple SO answers (e.g., how to chain Channel.fromFilePairs from one process to another process, nextflow: how to pass directory path with files created in process before), I have gone through the EMBL training (Nextflow), looked through pipelines on nf-core and github, and have spent time with nextflow documentation prior to asking this question. I seem to be misunderstanding some inherent principle of nextflow so any advice on what to focus would be appreciated.
Finally, for clarity and reproducibility I am using the following fastqs from NCBI SRA (SRX317818) for prototyping. These are the versions of nextflow, trim-galore, and bwa-mem2 I am using all via an Anconda environment:
bwa-mem2 2.2.1 hdf58011_5 bioconda
nextflow 23.10.1 hdfd78af_0 bioconda
trim-galore 0.6.10 hdfd78af_0 bioconda
dthorbur Thanks for your response. Firstly, thanks for letting me know about the SO link - I have updated and it should work now.
I have updated the input of the readMapping process as you suggested:
It works great, thanks for catching that simple mistake!
As for the indexed reference genome, I believe it was indexed properly. As mentioned, I indexed on an EC2 instance as my laptop was hanging whenever I would try indexing locally. The files look like this:
They seem to be the right files and of a reasonable size.
Perhaps I will try on my EC2 and see if there was something lost during the compression and transfer of the files to my local machine. All that being said, when looking at other questions on Biostars or SO, I notice most have an indexing process within their pipeline. This makes passing the index to the alignment step trivial as they can just access the emitted variable from the index process. As you stated, indexing only needs to occur once, therefore I would like to skip this step in my pipeline by designating a 'pre' indexed reference. Therefore, what is considered best practice when passing an index? Adding a parameter designating the base?
or something like I attempted in my current script?
Also, while I realize it is effectively the same process, I am using bwa-mem2 over bwa.
So I have updated my script a bit. Due to issues with BWA-MEM2 accessing its associated index, I opted to reindex my reference using standard bwa. Here is the command:
and it proceeds fine ending in the following message:
I then ran my updated script:
and am met with a new error:
which is thoroughly confusing as the 'SRR925811_2_val_2.fq.gz' file is passed from the previous step via an emitted variable.
Further, if I
cd
into the relevant work directory and run the command:bwa starts up just fine. I am thoroughly stumped as to this new error.
Looking a little closer at the error, I notice the command executed is:
where the mate pair is
null
. I am unsure why this is happening as the variable is passed from the preceding process in the workflow. It appears it is an issue with the input tuple I am passing, but I cannot see my mistake.A few things.
Firstly, you can use
bwa-mem2 index reference.fasta
to index, though I don't know if there is a difference to normalbwa index
.Secondly, instead of
get
, you can usedef
to get around tuples with multiple objects. But I don't know what is considered best practices.Thirdly, I think one of the problems are that you are only giving a path for the index files, not a channel, and if you did make a channel you are creating a queue channel with
params.index = "/Users/me/Desktop/RefGenome/BWAIndex/*"
.I'm used to errors from DSL1, but queue channels and value channels are used very differently, and what you have done will iterate over each file in the index directory rather than loading them all. Also, you only need the index in the environment, only the reference genome fasta is called in the
bwa-mem2
command. Also wildcard in a whole directory feels a little dangerous to me, but maybe that's just me.Here is an update for your process
And an updated workflow
And here is how I used indexes for other processes to make value channels in the past.
Notice the use of
/first()
, this will ensure the channel is a value channel though I think it's less importand in DSL2.dthorbur Hi, thanks for the response.
BWA and BWA-MEM2 do create different index files, so unfortunately their respective indexes are not compatible (from what I can tell).
Thanks for the tip, I will definitely utilize this variable assignment in future scripts and to get around any tuple data structure issues.
Thanks for pointing that out - I was (and still am) unclear as to queue vs value channels and will review the documentation appropriately. In regard to this, should I be designating the channel prior to passing to the workflow? Such as:
I guess I am trying to understand what is different between setting the channel using the method above vs using the path() method within an input block within the process.
I was originally passing the reference genome (
params.refgenome
) but would still be met with an index error, hence my use of creating a directory and globbing all the files within. This has seemed to work, but I agree with the potential side effects of wildcarding an entire directory.I will give some of your updated processes a go and see if they resolve my issue. All this being said, I seem to have found a relative fix to my original input and my index issues. However my most recent problem is output from the
trimAndQC
process where the 'read2' mate pair file, deriving from the emittedtrimmedreads
variable is being interpreted as a directory (see above comment and the following snippet of the error).Which is confusing as the 'read1' is interpreted properly and the output tuple appears to be defined correctly. Any insight you may have on this would be appreciated. Thanks again for your help and patience!
So an update. I basically made a script from the processes and workflow you provided and am still unfortunately getting an error.
First when passing the reference and the index to the readMapping process I get variable name collison:
After changing my param file patterns a bit (i.e.,
hg38*
to just '*' to try and increase stringency) I was met with this unbound variable error:Which is confusing as I am passing the input tuple properly. To simplify troubleshooting, I removed the following command from the process:
So when I remove the reference and just past the index I get this error:
Where nextflow is being too literal with the reference passed to the command.
And when I remove the index and just pass the reference (which is in the same directory as the index files) I get this error:
For full context here is the script I most recently ran:
Only other thing I can think of trying is your method of making indexes and value channels at the end of your last comment. I can report back once I have tried to adapt that to my machine.
The last error makes sense, but why is a
.ann
file being loaded. If you are using the script verbatim that you put at the bottom, that couldn't have happened.And the index not being present is correct, it's in the original path, but it has not been loaded into the working directory. Nextflow creates stages and creates symlinks of files in channels, so if it doesn't know it needs to load a file, it won't and by default it will use files in the working directory. Not so much a problem when working on a local machine or some HPCs, but when you use cloud computing absolute paths become a problem.
Otherwise, I don't see why you should be getting errors with the script you loaded at the bottom.
Also, I'm a little concerned why the
sample_id
values are seemingly not working.dthorbur I am using the script I posted verbatim so I am just as confused as you are.
To simplify things, I am testing the problematic process (i.e.,
readMapping
) into an individual script to try to sort out the index issue. Also after looking at some other SO and Biostar posts and combined with your above index advice I am running the following script:And while it appears to happily ingest the index and it does not throw an error, it does not 'run'. This is what my terminal looks like:
and it just does nothing. So it is starting up, but doesn't seem to execute anything. Further, the work directory made by nextflow is empty so there is no log I can try to investigate. I am completely lost here and not sure what is wrong. I have also tried removing the
samtools
part of the command and the same thing happens. Any advice is appreciated.After the last set of errors and out of pure frustration, I decided to create a new Anaconda environment with freshly downloaded software. Here is the yaml I used:
I then adapted your method of index designation. Here is the simplified script I ended up using:
This worked and made the sorted bam file within the
work
directory. I am rerunning with the correction and then will test with the trim process and report update the post as necessary.