Hello,
What is the customary mutation threshold for considering a site to be mutated in a virus?
I'm working on a project where apparently viruses have different mutations depending on the ethnic background of the host. When I mapped the virus sequences to the virus genome, I noticed that there were indeed several mutations at >= 20% (that is, >= 20% of the reads have a different nucleotide at a particular genome position than the reference nucleotide). I was wondering what is the customary threshold used? This is a new field for me and I'm trying to read the literature to figure this out, but I know how awesome you biostars are, and so here I am!
Thank you!
Anna
Hi, what do you use for known sites, an empty file? The BaseRecalibrator step "does a by-locus traversal operating only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality". Is this an appropriate assumption for viruses? Thanks.
BaseRecalibrator requires a list of validated variants, such as dbSNP. There are somewhat arbitrary guidelines for creating your own list (see here), but it's probably not relevant for your experiment.
Thanks. IndelRealigner and BaseRecalibrator are not relevant. So I went directly from remove duplicates to HaplotypeCaller. The command ran successfully with no errors but resulted in 0 variants! Perhaps because a mutation rate of 20% at a genome position is not considered 'significant' for humans, which is what GATK was designed for?
I'll try next creating a vcf file with the mutations I identified manually and using this as a "known" vcf db file and see if GATK calls these.
FYI. Not called even when in 'known' vcf file.I think I need to go the old fashioned way and read the literature!!
You have to remember that calling variants isn't as simple as counting the number of bases at a given position that don't match the reference. You have to consider the quality of the read, mapping quality, location of the aligned position within the read and etc.
You wouldn't want to call SNPs off of low quality bases located in the 3' end of the read, they're likely to be
Give me a moment, I'll pull up a script I have. It won't translate directly, and include the whole process from raw fq data, but you can see the steps I took. Check my OP.
EDIT: Sorry, it is very long, I'll give you the gist here.
Assuming you've assembled and etc:
The step VariantsToTable is what give you the final VCF, with properly called variants.
Thank you, Joe!