Hello, everyone. I sat up a mtDNA pipeline using samtools mpileup
+ varscan2
to do variant call. However this pipeline has one problem when there is more than one variants in same site, I can only get only one frequency value, I need one frequency data per variant because mtDNA heterogeneity is very important. Here is my result example:
MT 16183 rs28671493 A AC,C ... 2/2:255:562:556:24:530:95.32%:2.3041E-290:36:37:19:5:190:340
I get only 1 frequency data 95.32%
for two variants AC
and C
.
So I want to use GATK pipeline to do variant call, but I don't know how and their official websites don't have much information. Here is 2 questions I want to ask first:
- Do I need to do
BQSR
? My mtDNA sequencing is amplicon-based and has very high depth(about5000). Also which database should I use for--known-sites
param if BQSR. - Which mode should I use
Germline SNP + indels
orSomatic SNVs + indels
. I considerSomatic
because mtDNA heterogeneity, this would act like tumor.
Base (Quality Score) Recalibration
Tools involved: BaseRecalibrator, Apply Recalibration, AnalyzeCovariates (optional)
This third processing step is performed per-sample and consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it's important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model. The initial statistics collection can be parallelized by scattering across genomic coordinates, typically by chromosome or batches of chromosomes but this can be broken down further to boost throughput if needed. Then the per-region statistics must be gathered into a single genome-wide model of covariation; this cannot be parallelized but it is computationally trivial, and therefore not a bottleneck. Finally, the recalibration rules derived from the model are applied to the original dataset to produce a recalibrated dataset. This is parallelized in the same way as the initial statistics collection, over genomic regions, then followed by a final file merge operation to produce a single analysis-ready file per sample.
BaseRecalibrator [arguments]
--known-sites:FeatureInput One or more databases of known polymorphic sites used to exclude regions around known polymorphisms from analysis. This argument must be specified at least once. Required.
Thanks everybody!