Background: The CFTR gene has a Poly TG/Poly T region in intron 8 that has varying clinical consequences. I'm using targeted NGS for this region (150bp, paried end) and I'm trying to call the PolyTG/PolyT genotypes from fastq files. My first attempt was bbmap run with default parameters and a reference including (TG)9-12(T)5,7,9:
bbmap.sh ref=$ref in1=$R1 in2=$R2 scafstats=$out
39/40 samples could be correctly called based on the bbmap stats (header shown below). Heterozygous calls were made from ref_name's with the highest unambiguous reads. Homozygotes could be figured based on the ratio of the top two.
#sample ref_name per_unambiguousReads unambiguousMB per_ambiguousReads ambiguousMB unambiguousReads ambiguousReads assignedReads assignedBases
I expanded the reference to catch rare genotypes (e.g. T6 and TG13). The expanded reference has (TG)8-13(T)2-10. Running bbmap with this reference results in many incorrect calls.
I'm wondering if there are parameters I should be tweaking in bbmap or if there is a better solution to calling the Poly TG/Poly T genotypes that I'm not thinking of.
Thanks, Bob
Charles, thank you for the thoughtful response. I completely agree with your statement on the need to throughly validate variant calling, especially in a targeted panel, especially when reporting clinically. We are validating the assay for clinical use (screening). The script I posted above is only for TG(m)T(n) calling. The snp/indel, structural variant and CNV calling is all done using Archer Analysis, which under the hood uses Freebayes and in-house algorithms specific to their VariantPlex assay (anchored multiplex PCR w/ molecular barcodes). We are (IMO) very throughly validating the assay with multiple samples containing variants of all types (snp/indel/SV/CNV) - 200+ samples. In addition to these samples, we plan to generate in-silico positives to test more of the left/right alignment issues with indels. The script above was tested on these 200+ samples and correctly identified the (TG)m(T)n status for all samples. I will also add, any reportable (TG)m(T)n allele will be confirmed using an independent extract and technology.
Thanks for the details of your carrier information. It is technical issues such as this that keep me up at night! Even for pipelines that are heavily tested, there are always fringe cases that you did not think of. It is splitting hairs, but I would argue that the Mayo report is not incorrect in stating you are not a carrier [for the variants they look at]. The GET-evidence report im puzzled by; it correctly right-aligned it "CFTR-L88Shift", so I'm not sure why the correct annotations were not pulled.
The Mayo Report actually uses Exome Sequencing (not just the SNPs that they describe), so I was kind of surprised they don't consider that for their report (and suggested they report more variants with varying levels of confidence, if that is what they were worried about - even though the variant already had 3 stars in ClinVar).
However, unless you can open the file in Excel, I think visualizing the gVCF is more difficult to check / re-analyze than a .bam file (which I thought showed very clear evidence for my variant). So, that is one of the reasons I currently wouldn't recommend other people use that for testing.
Maybe there is some other explanation, but I thought the indel formatting was a factor (and I'm sure that must contribute to some formatting issues, although knowing the variant is in a homopolymer run from freebayes could also be good information to know).
Thank you for the detailed response!
I also see that NP_000483.3:p.Leu88fs is an alias in dbSNP, so perhaps that does match the "Insufficient Evidence" tab for GET-Evidence (although I probably should have mentioned the ClinVar report, even thought I don't see any CFTR variants there, for whatever reason).
Thank you for causing me to look into that! I updated the GitHub page with the IGV screenshot (it is a note for within the Genos Exome section, mentioning the Veritas WGS results on the PGP; however, I eventually hope to have a blog post that may make this more clear).
Charles, I looked into GET-evidence further. I see their about page does at least states it is not clinical quality:
They do allow anyone with a login to edit their annotations, so this variant and others that I spot checked against CFTR2 "known disease causing" could be corrected.
The Genos VCF looks malformed. I'm not sure if that contributed to the miss or not (maybe a silent fail on the annotation tool?):
However, the variant is in the VCF:
This variant is in the Helix gVCF (note Helix pipeline used hg38). I can see Mayo limiting their analysis to particular variants that they are very confident in a stable classification. I agree variants with "expert review" should have correct classifications, but I also don't have a good feel for how often a "pathogenic" 3 star variant is later found VOUS or benign:
It is in the Genos VCF:
Awesome - I am glad that this interested you enough to look into it more!
I didn't previously know about the CFTR2 resource - I think that is something really good to have. I used it to look up more information about 394delTT, and that also said that I should be considered a carrier.
I am also testing out updating the GET-Evidence report (to point out the dbSNP and CFTR2 entries, and I also added 4 PubMed entrires), and I noticed that it is getting a prioritization of 4. However, it didn't automatically connect to dbSNP, and (as far as I could tell) I couldn't link that myself (but I sent an e-mail that hopefully gets that sorted out).
Thanks again for following up!
Also, I'm not sure if people who otherwise get sequencing from Veritas for WGS have a different report (I got data from them, but through the PGP, with PGP reports). However, I did have an on-line physican sign-off for Veritas, Genos, GeneGuide, and Color (although I never talked to any of them, so I assume everybody gets approved). GeneGuide and Color also include support from a genetic counselor in their pricing.
However, Mayo GeneGuide specifically tests disease status for 4 genes (one of which is CFTR). They do Exome sequencing that is capable of detecting my variant (even though I could tell they weren't checking that in the formal report). ClinVar, CFTR2, and the genetic counselors all agreed that I should be considered a carrier, so I suggested GeneGuide support should look into changing what they include in their report (since I was reported as not being a carrier in their test).
Within the genetic testing registry, I would find "Whole Genome Sequencing" as a "clinical test" for Veritas, but I am not entirely clear what is counted under the "hereditary disease" portion of their test. Color is also in the GTR (and, if I remember correctly, I think they did provide the most detailed approval e-mail), but cystic fibrosis is not among the 26 conditions they official test for (although I hope they can update their options to provide raw data while I wait for the additional testing to complete, especially since I picked them to test because of the data.color.com interface). "Mayo Clinic Genetic Testing Laboratories" is in the GTR, but not the Mayo GeneGuide (or Helix). I could not find Genos as a "lab" in the GTR.
23andMe is also not in the GTR, even though they have FDA approval for some reports, and they were the only one that correctly identified me as being a cystic fibrosis carrier (even though I could have identified myself as a carrier with re-analysis of the raw data from Veritas, Genos, or GeneGuide/Helix). I know they have CLIA certification. I thought the cystic fibrosis result was FDA-approved because I currently see it, and they had to remove a lot of results for a period of time; however, I don't think the cystic fibrosis status is actually approved by the FDA because I don't see 23andMe listed for cystic fibrosis on this page from the FDA.
Update (4/20): I previously didn't think my variant was on the V5 chip (I had a V3 chip), but this was a formatting issue. I was looking for rs121908769, but I needed to use 23andMe's identifer.
So, this is what my cystic fibrosis variant looks like in 23andMe's raw data that you can download:
(carriers will be "DI," people that get cystic fibrosis from having two copies of this same variant will be "DD", even though this is not the most common cystic fibrosis variant)
I apologize for the confusion: I no longer have the specific concern about there being informative probes removed from the later array (at least for this variant).
Also, I figured this out when a relative informed me about their result. So, maybe the cystic fibrosis status is approved by the FDA to be returned to customers now? I admittedly still don't seem them on the page linked above, but I'm not otherwise sure about the explanation.