I have a sample that has multiple fastq files, one for each lane that it was sequenced on. I was planning on merging these fastqs and then using the bowtie2 -p option to take advantage of all of the available cores on my machine. I have read that it can be faster to first align the multiple fastqs in parallel, then merge them into a single .sam file. But given that I am already using the -p
option to parallelize alignment of my merged fastq, is this actually faster? For example, if I have 8 fastq files and 16 cores, which of the following is faster and by how much:
- align all 8 fastqs in parallel using 2 cores each, then merge the .sams
- merge fastqs then use all 16 cores to align
Thanks,
kaston
Devon, your response time is staggering! We'll try that, thanks!
Re; piping - we are currently writing the .sam files and then sorting and marking dups using Picard tools. The .sam files are ~200 GB in size and we have 70 GB of RAM available. I am assuming you think piping will avoid writing to disk, but is this avoidable with our file sizes and RAM constraints?
If piping is the way to go, we aren't sure what syntax to use. For sorting, our command is:
How would we change this to pipe the output of bowtie2?
I expect your answer in under 30 seconds.
Sorry to disappoint on the quickness of my reply, I blame the 6-9 hour time difference.
Piping obviously won't completely avoid writing to disk, but it's typically faster to convert to BAM and write that to disk than to write the raw SAM file.
Regarding piping with picard, see this post: Piping Input Into Picard Sortsam . In short, just use
/dev/stdin
is the input. You could also samtools (e.g.,bowtie2 ...stuff... | samtools view -Su - | samtools sort -T prefix -O BAM - > foo.sorted.bam
), though it doesn't index during sorting like picard does, which is too bad.