Question

How to apply GATK on mtDNA sequencing data?

0

Entering edit mode

7.0 years ago

MatthewP ★ 1.4k

Hello, everyone. I sat up a mtDNA pipeline using samtools mpileup + varscan2 to do variant call. However this pipeline has one problem when there is more than one variants in same site, I can only get only one frequency value, I need one frequency data per variant because mtDNA heterogeneity is very important. Here is my result example:

MT      16183   rs28671493      A       AC,C ... 2/2:255:562:556:24:530:95.32%:2.3041E-290:36:37:19:5:190:340

I get only 1 frequency data 95.32% for two variants AC and C.
So I want to use GATK pipeline to do variant call, but I don't know how and their official websites don't have much information. Here is 2 questions I want to ask first:

Do I need to do BQSR ? My mtDNA sequencing is amplicon-based and has very high depth(about5000). Also which database should I use for --known-sites param if BQSR.
Which mode should I use Germline SNP + indels or Somatic SNVs + indels . I consider Somatic because mtDNA heterogeneity, this would act like tumor.

Base (Quality Score) Recalibration
Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)
This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.

BaseRecalibrator [arguments]

--known-sites:FeatureInput    One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis.  This argument must be specified at least once. Required.

Thanks everybody!

GATK mtDNA • 2.1k views

ADD COMMENT • link 7.0 years ago by MatthewP ★ 1.4k

score 2 · Answer 1 · 2018-08-30

Hello MatthewP ,

Do I need to do BQSR ? My mtDNA sequencing is amplicon-based and has very high depth(about5000).

no, you don't need BQSR. BQSR is only useful if your overall basequality is very low and/or you have a low diversity genome. For all other situation the benefit by BQSR is minimal and just a waste of time.

Also which database should I use for --known-sites param if BQSR.

For human this is usually dbSNP.

fin swimmer