I'm interested in searching for mutations associated with an altered phenotype in a bacteria via resequencing (probably Illumina). This particular bacterial genome is ~7Mb and there is a reference available. I figure I should aim for single nucleotide resolution to be able to detect nearly 100% of SNPs. My question is, how can I determine the amount of coverage necessary to be able to detect 100% of SNPs? I found a reference from Holt et al. 2009 in Bioinformatics where they state they can detect 80% of SNPs at 45X coverage (http://bioinformatics.oxfordjournals.org/content/25/16/2074.full).
A paper that spells it out would be best, but if that isn't available do you think I could use Lander-Waterman and the error rate associated with Illumina to estimate the necessary coverage?
In the referenced paper they used pooled sampling and a GA_I sequencer.
If you use non-pooled samples and a HiSeq I'm pretty sure you should achieve quite a good coverage (probably not 100%, but may be you can reach 99%). A simple exercise is to get you reference genome and see if every 100bp (or whatever read length you'll use) is uniquely mappable. This is quite easy to do since your reference is only 7Mb and it will give you an idea of what read length you need to map all read (and if you need pair-end).
Thank you Pablo for the suggestion. I think I'll chop up the reference genome in to various lengths, resample with replacement up to various levels of coverage, and map the pieces back and see what fraction are unique.
Personally I think you will never be able to generate enough coverage to get 100% of the scores (thats why 80% is often referenced). The sequencing technique by itself has already difficulty enough to get through the hard/repeatable regions anyway.
Did you also consider structural variations! For that paired-end libraries should be beneficial.
ALchEmiXt, I think structural variations like CNVs might be too difficult to detect without paired ends and reads on the shorter end of the spectrum. We havent made out final choice of platform and I'm prepping for "worse case." But, if we can use 100 bp paired ends, then I'll definately look for CNVs.
Thank you Pablo for the suggestion. I think I'll chop up the reference genome in to various lengths, resample with replacement up to various levels of coverage, and map the pieces back and see what fraction are unique.