Hello everyone! Please excuse me if this question is a bit naïve: I'm new to bioinformatics in general and GATK in particular.
I am using the GATK4 suite to ultimately call germline variants on whole exome sequencing data obtained from an Illumina NextSeq 550 sequencer. (For a variety of reasons I cannot use the WDL/Cromwell setup recommended by the Best Practices, so I am trying to replicate the recommended workflow as a series of Bash scripts.)
I would like to speed up the BQSR step by employing the Scatter / Gather strategy. However, studying this article (https://gatk.broadinstitute.org/hc/en-us/articles/360035890531-Base-Quality-Score-Recalibration-BQSR-), I've realized that BaseRecalibrator requires a lot of data to build a proper statistical model.
My question: is it okay to scatter the BaseCalibrator job by chromosome if I analyze just one WES sample at a time? (I know that downstream I will need to perform joint genotyping with 30+ samples, but at the moment I'm preparing single-sample BAM files one-by-one.)
The article above says specifically that BaseRecalibrator expects each read group to have at least 100M bases. Calculated naively, PF_HQ_ALIGNED_BASES / 23 = 215+ megabases (the metric is taken from the CollectAlignmentSummaryMetrics output).
Thank you!
— Alex.
P.S. This is a repost of my question from the GATK forum. I apologize if this is generally frowned upon, but since this is not a technical issue with the tool itself, the team could not offer any guidance as of yet.
Could you post your BQSR command please ? How many samples do you have ? BQSR should not take too much time in my experience..