Hi,
I am following GATK's Best Practice Workflow for germline short variants discovery in single samples. The pipeline is composed of the following steps:
FastqToSam
MarkIlluminaAdapters
FastqToSam
bwa mem
MergeBamAlignment
MarkDuplicates
BaseRecalibrator
ApplyBQRS
ValidateSamFile
HaplotypeCaller
I work with paired-end sequencing data, and mostly each sample has one forward read FASTQ
file, and one reverse read FASTQ
file. However, I have a couple of samples for which sequencing data is divided onto multiple lanes. I have seen that most commands in the pipeline do not allow for multiple lanes input, but only one forward and one reverse (or one unmapped bam and one aligned bam
for MergeBamAlignment
). Should I merge all forward and all reverse FASTQ
files before starting the pipeline (quality of each dataset seems comparable according to FastQC/multiQC
) or only later (and, if so, which step would be the best)?
Thanks for your input.
Hi lieven.sterck ,
thanks for the reply. Of course, I will keep samples separated. I was just wondering what best practices would dictate as to when (what step of the pipeline) to merge the various lanes results.
I would say step 1. Personally I join/merge all lanes of a sample into a single file before I do anything with them. (those different lanes sequencing is just a technical thing of the sequencing so you can harmlessly cat them together. Keep the files in sync though!!)