I see that there are many tools for filling gaps in a reference genome. I understand why this is a useful step in order to have as much contiguous sequence as possible. However, one thing I have been wondering about, which I haven't seen discussed anywhere, is whether this has the downside of introducing duplicate sequence into the reference, particularly for a highly fragmented genome, and whether there are any tools for dealing with this. For example, I am working with a genome with ~60,000 contigs which together sum to roughly the expected size of the genome. These have been scaffolded into ~45,000 scaffolds. I am wondering whether it makes to perform gap filling at this stage, because, if all of the genome is contained within the set of scaffolds + unscaffolded contigs, filling in gaps will generate sequence which is already contained within contigs which were not able to be scaffolded. My intuition is that this would actually worsen the quality of read mapping as reads could then map ambiguously to the gap-filled sequence or to a contig which simply did not get scaffolded, but truly belongs in scaffold gaps. Can anyone comment on whether this is likely to be a problem and if there are common solutions to this problem?
To clarify, the contigs are from a PacBio assembly, and these have already been scaffolded using mate pair links. My question is about gap filling using mate pair sequence i.e. building consensus sequence for the gap based on the 'overhang' read from a mate pair in which one read maps within a contig within a scaffold and the other is expected to fall within the gap. The conversion of
N
s to nucleotide sequence is going to create sequence which already exists within the contigs, but just isn't in the scaffold already, no?okay. I never did it this way, but used contigs assembled from short PE fragments which I then scaffolded using long reads (or formerly long MP fragments)
Given the context above, I'm not sure I understand your question here. Let me try to answer what I think I understand: In case you're assembly can't be scaffolded better with long reads, chances are the gaps in the sequence are somewhat too long to brigde them with PacBio reads. In that case, you're MP library requires enormous coverage to reliably fill the region. The variation of insert size distribution on MP libraries are usually quite large, so you can't just estimate the position in the gap based on the distance but need to stack overlapping reads.
However, I guess there's repetitive sequence in the gap (that's why the gap is most likely there). In that case short reads will create ambiguous connections between the fragments. Imagine aligning multiple reads comprised of AT only to span an AT region longer than a read.