I'm currently trying to use the GATK pipeline for SNP calling to call SNPs. At the step where you use GATK's 'BaseRecalibrator' tool, you have to provide a set of known SNPs. Unfortunately, for my organisms, these are not annotated.
However, these organisms do have a reference genome sequenced. In a nature paper from Prado et al, they created a...
"set of training SNP positions that are independent of Illumina short read data based on capillary sequence traces obtained for each species from the NCBI trace archive based on the whole genome shotgun (WGS) reads produced for the species reference assemblies. We mapped capillary reads using ssaha2 and called SNPs using the neighborhood quality score criteria implemented in ssahaSNP." (Prado-Martinez et al Nature 2013, supplementary information)
So looking at the NCBI trace archive, it seems like there is a pretty nice ftp site (and even a java applet that you can use to download things). However, I'm a bit confused on how this will help me call SNPs. On the ssahaSNP page, they say that:
"ssahaSNP is a polymorphism detection tool. It detects homozygous SNPs and indels by aligning shotgun reads to the finished genome sequence." http://www.sanger.ac.uk/science/tools/ssahasnp
How can you call 'homozygous' SNPs from the finished genome sequence using shotgun reads? Shouldn't the finished genome be the consensus of those reads, and therefore homozygous SNPs would just be the base called in the reference?
Any ideas?
Thanks!
Hi John, Thanks for your reply.
Do you have an example of a "simple" SNP caller that you mentioned above, or would you just write a very simple one using some heuristics (e.g. call a SNP if it doesn't violate a binomial distribution with p=0.5)? I think I could do that fairly easily with python using the pysam module...
I understand your second point. However, what if you were using the same sequencing reads from the same single individual that a genome project is based on (e.g. CLINT for the chimpanzee genome)? Even if the trace archive had other individuals on it, wouldn't you get conflicting information if you input all of these reads (from Clint and some other individuals), so it wouldn't call a SNP? I'm sorry for these elementary questions, but the sshaSNP site doesn't have much info on it. I'm wondering if perhaps you (or others) have some experience with this tool.
Well if it's from the same individual, then I guess that throws out the possibility of a homozygous SNP, unless there's some technical artifact in de novo alignment that could lead to that (I know very little about de novo alignment beyond the absolute theoretical basics, so i can't say!).
But perhaps it's the terminology that's causing the problem here. The term "SNP" is very ambiguous, and some people will swear blind that it means X, while another chap is sure it means Y. Often X will be a senior tenured geneticist and will argue from authority, whilst Y will point to a text book or Wikipedia and argue from a different kind of authority. Truthfully, the term was coined before anyone really knew what was going on, and it doesn't hold a whole lot of value since it's so ambiguous.
Every base in your genome with a 1:1 mapping to a reference genome has the probability of being one of 4 states - A,C,G, and T. The reference base is usually the most commonly seen base in the population - but not always, as there are some human SNPs which occur above 50% frequency. In fact, the very idea of a reference genome doesn't really make much sense if it's just down to frequency - when the UK10K was first being established, I sat in some very long and boring conference calls where we discussed what we should do to prevent UK SNPs from dominating dbSNP and artificially inflating the allele calls, which might lead to the reference human genome becoming very UK-centric, which would obviously unfairly effect the mapping rates of DNA from less-established countries. (NB: This was 5-6 years ago before long NGS reads where a thing, in fact people thought they'd get smaller and more numerous).
So to some, a SNP is a single base variation compared to the reference which occurs with a significant frequency, like above 1%. To others, just 1 occurrence is enough to be called a SNP (although I would call that a variant, not a SNP). Others still will drop the "single base" qualifier and say it must be a "single unit", such as a multi-base deletion or even multiple bases that change together such as CTG to CAG (both code for Serine, so this is actually pretty common). Other's would call that two SNPs. Bioinformaticians often say SNPs are single-base whilst multi-base insertions and deletions are something totally different called "indels" -- however that is a fairly recent distinction created as a result of the software that returns these variations producing one file for the single-base variants, and another INDEL file for the insertion/deletion variants. An old-school geneticist might say such distinctions are meaningless, however they probably don't appreciate that two individuals could have the exact same genome, but their VCF files have different indels.
So depending on your definition, you can have a homozygous SNP having sequenced just 2 individuals from the entire species - however it might not be a SNP for long, after a 3rd individual gets sequenced. You see the issue.
So maybe in light of the above, this is a little clearer to you now. There's no such thing as conflicting information. There is only good sequencing and bad sequencing :)
EDIT: Oh and for a simple SNP caller you probably just want samtools mpileup or something, and parse the output for at least three reads that support the SNP and forget probabilities all together. You're not really calling SNPs - you're excluding regions that may be SNPs, but keeping all the sequencing errors. Essentially, you are trying to find (to exclude) regions where 100s of reads all start from an indel (like this: Strange Pattern in Bam File ) because that has a legitimate chance of b0rking the BQSR.
Wow, what a great reply. I wish I could give you more than just one upvote for that. Thanks for taking the time to answer my questions about this. I definitely understand the subtleties of BQSR and SNP calling in general much better after reading your comments.
John's reply IS great, but I take issue with one statement:
Not trying to be pedantic, but (as an old-school geneticist) the term itself is hardly ambiguous. It's an acronym for single-nucleotide polymorphism (one nucleotide, more than one "morph" or form). It's distinct from single-base insertions/deletions as well as multi-base substitutions. Just because some users are sloppy with the terminology doesn't change the definition.
Regarding alignment-based variant calling, it's a SNP if it differs from your reference genome, irrespective of its frequency in the general population. And the reference is determined by the individual(s) whose DNA was used for the reference genome assembly. John's point about a UK-centric reference genome is spot-on.
Also, the 1% threshold is arbitrary. For example, the SNP that causes sickle-cell anemia falls below that limit in an UK-centric genome but greatly exceeds it in West African populations.
You're absolutely right - and of course I agree with you in all the points you brought up. I should have made it more clear that it's ambiguous not by definition, but by general understanding, in that you can be in a room with several people who all are using the term SNP slightly differently. You and I actually (surprisingly) have the same understanding of what a SNP is - probably because we are both geneticists by training - however I would wager that it is unlikely that we make up the majority, given that there are so many polymorphisms of what SNP can mean to people (har har). I can point you too a number of official-looking places which says things like:
So personally, and this is where we may disagree, I feel like it's too late to set the record straight for SNPs. Too many erroneous blog posts and textbooks have been written. Contradictory definitions are too engrained in the culture and people get really fired up about their definition. I feel it's better to use the term SNP to mean "whatever genomic variant we were just talking about", or just not use it at all and stick to "variant". I know i'm technically just spreading misinformation when I do that, but pragmatically it protects me from making assumption-errors when talking to others. Particularly since I haven't been involved in Genetics for quite some time now...
I suppose I should end on a positive note. Well these guys claim there's little to no evolutionary difference between SNPs and small INDELs, and we should think of them as different means to the same end. I think that's a good take-home message for the OP - that despite the distinctions we make in our vocabulary, the cell doesn't care what we call it's work - it only cares about the outcome, and that should be our focus too if we ever hope to understand it :-)
Hi John, one more question if you could spare the time.
Looking at samtools mpileup, I can't seem to figure out how to set the sensitivity based on 'number of reads that support' a SNP. The only thing I see is a -m flag (minimum number of reads for indel candidates). However, there doesn't seem to be anything for SNP candidates. The only thing I see that will increase the sensitivity of the program is to turn off the probabilistic base alignment quality (BAQ) calculation using the -B flag.
Thanks again!
You can specify parameters for variant calls using bcftools (available with samtools) filter command, e.g. a minimum read depth of three is: