Best Practice On Variant Discovery For Bacteria?
7
9
Entering edit mode
12.4 years ago
lh3 33k

I am mostly working with primate data, but I doubt if the best practice for such data is still the best for bacteria given higher divergence between bacterial strains. Here are two possible ways to do SNP/INDEL discovery for bacteria:

  1. Just like variant discovery for mammals, we map reads to the reference strain with a standard mapper and run GATK/samtools/in-house tools to find variants. If this is still the case, which mappers and SNP callers do you use? How do you change the default settings?

  2. Do de novo assembly first and then align contigs to the reference strain to find variants. If this is the case, which assembler (velvet, soapdenovo or abyss?) and aligner (mummer?) do you use?

  3. There much be other ways of analyzing data I do not know...

• 13k views
ADD COMMENT
1
Entering edit mode
  1. BWA/MAQ for sure. all the other aligner/mapper are having some sort of dishonest to the truth.(Sorry to say that, but that is what I found after my in house test). Even though, in some specical cases, bwa still has incorrect mappings. But it is the best in current(speed, space, and correctness) since maq is died. If you pay attention on read quality filtering and reference self similarity reduction, it would be greatly increasing your correctness. Dont let the trash reads mess up entire analysis.
  2. I dont recommend the assembly-snp detection line on Bacteria. If you have to, try to hash based assembler , like Mira. it will generate a consensus result. Velvet with its de-brujin Graph datastructure, could loss some important informations like discarding the SNPs.
  3. Sorry I dont used any others and I dont know either. Try apply math into it rather than others tools. because they have all some sort of included their own pre-assumtpions already which you dont know exactly. (especially skip those fancy GUI designed ones)
ADD REPLY
1
Entering edit mode

MIRA will also give you strain sequences, use convert_project for that

ADD REPLY
2
Entering edit mode
12.4 years ago

I work on Breseq, http://code.google.com/p/breseq/, and would suggest you give it a try. It'll map with SSAHA2.

The big name mappers are BWA and Bowtie. BWA + samtools is fine for SNP and small( < 5bp) INDELS.

ADD COMMENT
0
Entering edit mode

But breseq doesn't support SNPcalling between sequenced case and control strains

ADD REPLY
2
Entering edit mode
11.9 years ago

This is a bit late, I postponed replying to this til our latest paper came out, and then I couldn't find this question anymore. Anyway - this may be of interest: we just published this paper

High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts673

http://bioinformatics.oxfordjournals.org/content/early/2012/11/19/bioinformatics.bts673.full.pdf+html

which shows how you can use direct de novo assembly of multiple samples to discover variants. Since bacterial genomes tend to be sequenced to higher depth than vertebrates, and since the genomes are relatively unrepetitive, assembly has comparable power to mapping, but has better specificity for indels and structural variants. Cortex now has a wrapper/framework that allows you to call at multiple kmers. The paper gives a couple of examples, reproducing the results (or comparably good results) for two published studies, one with long (Sanger) reads and one with Illumina reads.

ADD COMMENT
0
Entering edit mode

It is not true that bacteria are "relatively unrepetitive". There are a lot of species which are full of repeats such as copies of insertion sequences etc. For example, Mycobacterium ulcerans has >200 copies of IS2404 and >600 copies of IS2606, each of which is >1500bp. Tandem repeats are common too eg. CRISPR. Sure, they're not as bad as ALU repeats in human etc, but the blanket statement that they aren't repetitive isn't quite true :)

ADD REPLY
0
Entering edit mode

FIne, I stand somewhat corrected. The M ulcerans example is quite surprising to me - thats 20% of its genome! However, I do think on a global scale, where we have trees, wheat, barley, vertebrates, lamprey, lungfish, and bacteria, bacteria are at the less repetivie end of the spectrum.

ADD REPLY
0
Entering edit mode

I sound very petty here! Sorry Torsten.

ADD REPLY
2
Entering edit mode
11.9 years ago
Raygozak ★ 1.4k

Another important tool for haploid variant calling is freebayes (http://bioinformatics.bc.edu/marthlab/FreeBayes), it actually is built to support any ploidy from your organism but makes it easier than samtools mpileup - bcftools, since samtools is tailored for diploid organisms, and while you may be able to process it further, freebayes works like a charm.

ADD COMMENT
0
Entering edit mode

I have explained several times: to call from a single sample, a diploid caller is the way to go. This strategy is always better. I have not seen a single counter evidence so far.

ADD REPLY
1
Entering edit mode

Could you please address me to longer explanation you might write some time ago? Thanks a lot.

ADD REPLY
1
Entering edit mode

If a diploid caller is better than a haploid caller from single-sample haploid data, would a polyploid caller not be even better?

ADD REPLY
1
Entering edit mode
12.4 years ago

I use bwa/samtools, it seems to work fine, and I use velvet, or sometims edena, when I want to make a de novo assembly, but with the read coverage I typically get, it yields a whole lot of pieces.

GATK requires a list of known variants, which is kind of weird when your goal is to look for variants you don't know about.

ADD COMMENT
0
Entering edit mode
12.4 years ago
liz.batty ▴ 30

For reference-based variant calling, we map with stampy without bwa pre-mapping and the substitution rate set to 0.01, and call variants using an in-house mix of samtools, picard, and GATK. We assemble with velvet, and sometimes order the contigs into a new reference and map to that with the same tools as the standard references, or we've used Cortex to do de-novo variant calling with some succes to get SNPs and short indels.

ADD COMMENT
0
Entering edit mode
11.9 years ago
rst ▴ 20

If you go the de novo route I would suggest looking at EDENA as an assembler. The false positive rates, at least in my experience are greatly reduced which simplifies things downstream. I have noticed that it does require slightly higher coverage >100-150x to work really really well which usually isn't a problem when resequencing prokaryotes.

ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

I used HaploTypecaller but my plasmid reads are not present in the output vcf file ... but it's there in my input file

ADD REPLY

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6