Hi everyone,
INTRO: I have genome assembly obtained from linked reads (10X Genomics) + ONT long reads. The initial assembly was done in Supernova, gaps filling and scaffolding were done in PBJelly. This was done by outsourcing, but the company did not provide any information about the gaps.
PROBLEM: The size of gaps (Ns) varies in the assembly from 10 to 100,000. I need to specify for the submission how the sizes of gaps were estimated, and what number stays for unknown gap size.
QUESTIONS:
- Is there any general procedure/rule for estimation of gap size during this kind of assembly?
- I am especially wondering about gaps with rough numbers like 10, 100, 5000, 100000, etc. What these stand for? Do these represent known size or do they stand for the unknown gap size?
NOTE: Asking the company is not the best way, as this analysis was outsourced three years ago and the company does not communicate much smooth these days.
Thanks a lot in advance Milos
where are you submitting to? ENA? NCBI?
most commonly the standard is to use a stretch of 100 Ns for gaps of unknown size, and the actual gap size for the ones that can be estimated.
gap size estimates are usually done with the use of paired-end/mate pair read data (or read length) ...
Thank you.
I am submitting it to DDBJ. Generally, I understand that unknown gaps should be 100Ns. I guess that the company providing NGS services should be aware of that as well. But, does it mean that all other gaps are of known sizes? Difficult to say, right?! For example, there are 15,366 10Ns gaps and 131 5,000Ns gaps in the assembly. Isn't there any general role in how the gap size is treated in PBJelly?
I don't know the specifics of PBJelly but what is key to report are the gaps introduced by scaffolding your sequences. There the gap estimates are usually actual estimates. Those gaps of 10N are likely within contigs and those are less of an issue (gap estimate is more accurate and has only little effect on the overal genome structure/size