You already have the Harismendy et al. publication, which is probably the current reference on the topic. In what respect would you expect this to change for non-human samples ? (DNA is DNA, RNA is RNA...)
Base/simple sequence composition, theta, haplotype structure, etc. all vary across species. Thus, to the extent that systematic NGS sequencing error is sequence dependent, it may be necessary to generate species/platform-specific error profiles when resequencing different species (especially diploids).
(Warning: Shameless self-promotion ahead!) We looked into this for 454 to write a simulator for pyrosequencing data. If you read the paper, you'll find a bit of discussion about the various error that occur, but to me the interesting (but sad) fact is that when we assemble simulated data sets, we get better results than from real data sets. Which means that we still don't know the kind of artifacts that causes this. Some candidates are: uneven coverage/duplicate clones, chimeric sequences, PCR amplification errors. Any thoughts on this most welcome.
I'd also point to Chris Quince's work on metagenomics, he's looked quite a bit into how errors inflate the number of estimated taxa, and how to deal with that.
My feeling is that uneven coverage is the biggest problem with assembling real data (after base-calling errors of course). Even with high coverage Illumina data, I see many contig breaks due to a lack of/low coverage. Chimeric sequences are less of a problem for me but I work mainly on Illumina data.
I hope you're right! Our experience is mostly with 454, and there increasing coverage didn't seem to help a lot for our assemblies -- but this could simply be an issue with the assembly software. Or it could be 454-specific artifacts, or it could be that 454 libraries are more uniformly distributed for some reason.
Unfortunately, this work addresses the impact of NGS platforms on transcriptome coverage. They don't estimate of NGS error rates, nor do they model error in their simulations:
"ESTcalc and the underlying simulations do not currently incorporate explicit models of sequencing and assembly errors...A next step will be to develop realistic models of error in sequencing and assembly, and to provide tools to allow any sets of assumptions about read length and cost to be examined."
What I'm looking for is empirical estimates to help with interpretation of genome-wide SNP/indel data.
You already have the Harismendy et al. publication, which is probably the current reference on the topic. In what respect would you expect this to change for non-human samples ? (DNA is DNA, RNA is RNA...)
Base/simple sequence composition, theta, haplotype structure, etc. all vary across species. Thus, to the extent that systematic NGS sequencing error is sequence dependent, it may be necessary to generate species/platform-specific error profiles when resequencing different species (especially diploids).