Evaluation Of High Throughput Sequencing Error Rates ?
2
15
Entering edit mode
14.3 years ago

Can anyone recommend papers evaluating/comparing the rate and type of errors for different high throughput sequencing platforms?

I've been able to find one major paper on humans (http://genomebiology.com/content/10/3/R32), but was hoping to know how these results generalize to other species.

Many thanks, Casey

next-gen sequencing papers error • 7.5k views
ADD COMMENT
0
Entering edit mode

You already have the Harismendy et al. publication, which is probably the current reference on the topic. In what respect would you expect this to change for non-human samples ? (DNA is DNA, RNA is RNA...)

ADD REPLY
0
Entering edit mode

Base/simple sequence composition, theta, haplotype structure, etc. all vary across species. Thus, to the extent that systematic NGS sequencing error is sequence dependent, it may be necessary to generate species/platform-specific error profiles when resequencing different species (especially diploids).

ADD REPLY
4
Entering edit mode
14.1 years ago
Ketil 4.1k

(Warning: Shameless self-promotion ahead!) We looked into this for 454 to write a simulator for pyrosequencing data. If you read the paper, you'll find a bit of discussion about the various error that occur, but to me the interesting (but sad) fact is that when we assemble simulated data sets, we get better results than from real data sets. Which means that we still don't know the kind of artifacts that causes this. Some candidates are: uneven coverage/duplicate clones, chimeric sequences, PCR amplification errors. Any thoughts on this most welcome.

I'd also point to Chris Quince's work on metagenomics, he's looked quite a bit into how errors inflate the number of estimated taxa, and how to deal with that.

ADD COMMENT
0
Entering edit mode

My feeling is that uneven coverage is the biggest problem with assembling real data (after base-calling errors of course). Even with high coverage Illumina data, I see many contig breaks due to a lack of/low coverage. Chimeric sequences are less of a problem for me but I work mainly on Illumina data.

ADD REPLY
0
Entering edit mode

I hope you're right! Our experience is mostly with 454, and there increasing coverage didn't seem to help a lot for our assemblies -- but this could simply be an issue with the assembly software. Or it could be 454-specific artifacts, or it could be that 454 libraries are more uniformly distributed for some reason.

ADD REPLY
1
Entering edit mode
14.3 years ago

This paper published in BMC Genomics might be of interest to you.

ADD COMMENT
0
Entering edit mode

Thanks Lars -

Unfortunately, this work addresses the impact of NGS platforms on transcriptome coverage. They don't estimate of NGS error rates, nor do they model error in their simulations:

"ESTcalc and the underlying simulations do not currently incorporate explicit models of sequencing and assembly errors...A next step will be to develop realistic models of error in sequencing and assembly, and to provide tools to allow any sets of assumptions about read length and cost to be examined."

What I'm looking for is empirical estimates to help with interpretation of genome-wide SNP/indel data.

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6