Is there a way to estimate how large a gap might be after performing 454 pyrosequencing followed by Newbler? I have several closed reference genomes, and I know that my read length is about 400bp with about 40x coverage. Therefore I have high confidence that these gaps are due to repeat regions.
edit I guess this might be answerable by knowing the repeat regions in a genome. How would I identify repeat regions just by the sequence alone? If I knew this length then I would take the repeat region length, L and calculate it by gapLength = L-(2 * 400).
For the fire ant genome, we noticed dramatic improvements in newbler performance when we increased parameter stringency to minimum 100bp overlap between reads and 98 or 99% identity. The best explanation I see is that those parameters helped resolve repeats (because of stringent parameters... many old repeats became unique sequences...)
I think that even with 40x coverage, you're not guaranteed to have reads covering all gaps, and the theoretical models don't work so well in practice. I don't know any exact numbers for this (and it probably varies from run to run and lab to lab), but coverage tends to be uneven, and there could be features of the sequence that makes some parts rare or unsequenceable. It's well known that you get duplicated clones (the same clone on multiple beads), which is one form of unevenness.
Newbler takes care of these duplicates: it identifies them, does not remove them, but when t for example deteermines the consensus bases, the duplicates count for one (same for average read depth).
I assume you have shotgun reads only? For newbler assemblies, you can actually find the repeats among the contigs by looking at the per-contig read depth. With apologies for the self-promotion, here is a paper describing just that: http://www.hindawi.com/journals/seq/2010/782465.html. Contigs with higher-than-normal read depth are collapsed repeats, and the depth is proportional to the copy number.
This will at least tell you what (contigs) the repeats are. Looking at the 454ContigGraph file could tell you which contigs the 'neighbours' of the repeats are.
Is it an eukariotic or bacterial genome?
bacterial genome