Virus mutation definition sequencing
1
0
Entering edit mode
9.1 years ago
Anna S ▴ 520

Hello,

What is the customary mutation threshold for considering a site to be mutated in a virus?

I'm working on a project where apparently viruses have different mutations depending on the ethnic background of the host. When I mapped the virus sequences to the virus genome, I noticed that there were indeed several mutations at >= 20% (that is, >= 20% of the reads have a different nucleotide at a particular genome position than the reference nucleotide). I was wondering what is the customary threshold used? This is a new field for me and I'm trying to read the literature to figure this out, but I know how awesome you biostars are, and so here I am!

Thank you!

Anna

virus sequencing mutation • 3.1k views
ADD COMMENT
2
Entering edit mode
9.1 years ago
pld 5.1k

You'll want to actually get into the variant calling. Although GATK isn't designed for haploid organisms (ssRNA viruses for me), I've had success using that pipeline to call variants.

ADD COMMENT
0
Entering edit mode

Hi, what do you use for known sites, an empty file? The BaseRecalibrator step "does a by-locus traversal operating only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality". Is this an appropriate assumption for viruses? Thanks.

ADD REPLY
0
Entering edit mode

BaseRecalibrator requires a list of validated variants, such as dbSNP. There are somewhat arbitrary guidelines for creating your own list (see here), but it's probably not relevant for your experiment.

ADD REPLY
0
Entering edit mode

Thanks. IndelRealigner and BaseRecalibrator are not relevant. So I went directly from remove duplicates to HaplotypeCaller. The command ran successfully with no errors but resulted in 0 variants! Perhaps because a mutation rate of 20% at a genome position is not considered 'significant' for humans, which is what GATK was designed for?

ADD REPLY
0
Entering edit mode

I'll try next creating a vcf file with the mutations I identified manually and using this as a "known" vcf db file and see if GATK calls these.

ADD REPLY
0
Entering edit mode

FYI. Not called even when in 'known' vcf file.I think I need to go the old fashioned way and read the literature!!

ADD REPLY
0
Entering edit mode

You have to remember that calling variants isn't as simple as counting the number of bases at a given position that don't match the reference. You have to consider the quality of the read, mapping quality, location of the aligned position within the read and etc.

You wouldn't want to call SNPs off of low quality bases located in the 3' end of the read, they're likely to be

ADD REPLY
0
Entering edit mode

Give me a moment, I'll pull up a script I have. It won't translate directly, and include the whole process from raw fq data, but you can see the steps I took. Check my OP.

EDIT: Sorry, it is very long, I'll give you the gist here.

Assuming you've assembled and etc:

  1. Generate raw VCF using samtools mpileup
  2. Use picard to mark dups
  3. GATK, follow their pipeline: RealignerTargetCreator -> IndelRealigner -> BaseRecalibrator -> AnalyzeCovariates (optional, but a good QC step) -> PrintReads -> UnifiedGenotyper -> VariantsToTable

The step VariantsToTable is what give you the final VCF, with properly called variants.

ADD REPLY
0
Entering edit mode

Thank you, Joe!

ADD REPLY

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6