Question

Evaluation Of High Throughput Sequencing Error Rates ?

15

Entering edit mode

14.4 years ago

Casey Bergman 18k

Can anyone recommend papers evaluating/comparing the rate and type of errors for different high throughput sequencing platforms?

I've been able to find one major paper on humans (http://genomebiology.com/content/10/3/R32), but was hoping to know how these results generalize to other species.

Many thanks, Casey

next-gen sequencing papers error • 7.6k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 14.4 years ago by Casey Bergman 18k

0

Entering edit mode

You already have the Harismendy et al. publication, which is probably the current reference on the topic. In what respect would you expect this to change for non-human samples ? (DNA is DNA, RNA is RNA...)

ADD REPLY • link 14.4 years ago by Laurent Gautier ▴ 810

0

Entering edit mode

Base/simple sequence composition, theta, haplotype structure, etc. all vary across species. Thus, to the extent that systematic NGS sequencing error is sequence dependent, it may be necessary to generate species/platform-specific error profiles when resequencing different species (especially diploids).

ADD REPLY • link 14.3 years ago by Casey Bergman 18k

score 4 · Answer 1 · 2010-10-29

4

Entering edit mode

14.2 years ago

Ketil 4.1k

(Warning: Shameless self-promotion ahead!) We looked into this for 454 to write a simulator for pyrosequencing data. If you read the paper, you'll find a bit of discussion about the various error that occur, but to me the interesting (but sad) fact is that when we assemble simulated data sets, we get better results than from real data sets. Which means that we still don't know the kind of artifacts that causes this. Some candidates are: uneven coverage/duplicate clones, chimeric sequences, PCR amplification errors. Any thoughts on this most welcome.

I'd also point to Chris Quince's work on metagenomics, he's looked quite a bit into how errors inflate the number of estimated taxa, and how to deal with that.

ADD COMMENT • link 14.2 years ago by Ketil 4.1k

0

Entering edit mode

My feeling is that uneven coverage is the biggest problem with assembling real data (after base-calling errors of course). Even with high coverage Illumina data, I see many contig breaks due to a lack of/low coverage. Chimeric sequences are less of a problem for me but I work mainly on Illumina data.

ADD REPLY • link 14.2 years ago by Jts ★ 1.4k

0

Entering edit mode

I hope you're right! Our experience is mostly with 454, and there increasing coverage didn't seem to help a lot for our assemblies -- but this could simply be an issue with the assembly software. Or it could be 454-specific artifacts, or it could be that 454 libraries are more uniformly distributed for some reason.

ADD REPLY • link 14.1 years ago by Ketil 4.1k

score 1 · Answer 2 · 2010-08-05

1

Entering edit mode

14.4 years ago

Lars Juhl Jensen 11k

This paper published in BMC Genomics might be of interest to you.

ADD COMMENT • link 14.4 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Thanks Lars -

Unfortunately, this work addresses the impact of NGS platforms on transcriptome coverage. They don't estimate of NGS error rates, nor do they model error in their simulations:

"ESTcalc and the underlying simulations do not currently incorporate explicit models of sequencing and assembly errors...A next step will be to develop realistic models of error in sequencing and assembly, and to provide tools to allow any sets of assumptions about read length and cost to be examined."

What I'm looking for is empirical estimates to help with interpretation of genome-wide SNP/indel data.

ADD REPLY • link 14.4 years ago by Casey Bergman 18k