I'm calling variants using the IonTorrent TorrentSuite on DNA which has been sequenced from formalin-fixed paraffin-embedded tissue. This has a major issue in that without addition of uracil-N-glycosylase, some of the Ts in the original DNA are deaminated to Cs, which upon sequencing and calling variants can show up as mutations, either as T>C transitions or G>A (from the opposite strand, due to PCR in the library prep). I do not have any idea how long these samples were stored without UNG before sequencing.
TVC (Torrent Variant Caller) gives a deamination metric (essentially, sum of T>C and G>A variants over all variants called), and for our samples, the highest value seen is ~0.92. Naively postprocessing the variants show that for these samples, C>T/T>C transitions ( [1]: https://ibb.co/KjPkkT3) overwhelm the remaining variants among my samples.
My question is this, given the IonTorrent variant calling pipeline (sequencing > BAM file > TVC > VCF file with deamination statistic), is there:
a) a way of correcting the output VCF, or b) a set of filters to use in bcftools,
to reduce this effect on the samples?
My use case is this: these are medical samples, which have been inspected by a pathologist (hence the FFPE treatment), and I want to determine which variants are predictive* of outcome, hence I have two potentially contradictory goals: reduce false positives and capture the rarer variants which may hold predictive power.
how bad would it be to restrict the analysis to transversions?
That's one possible strategy, though this could throw out lots of important information with regards to the clinical outcomes. The other strategy that I'm playing with is to cluster the called variants on quality and on allele fraction. [C>T] and [G>A] transitions due to the DNA damage are likely to be randomly located throughout the DNA, so the reads with these transitions won't pile up as much as true mutations, and hence have low quality and a low percentage occurrence. If this is true, then [C>T] and [G>A] transitions due to deamination of the original DNA may form a cluster separate to the true positive mutations, enabling a filter. I just need to aggregate all my data together and analyse the stats to see if this second strategy could work.
can I ask you, could you plot the damage? Also, could you post a bit of the data somewhere and msg me, I have never had my hands on IonTorrent aDNA before.