Hi there,
I have a very fragmented reference genome that I want to visualize (with jbrowse or similar)
This is the paper for the genome: link
Here the genome can be found: link2
Here are the quast statistics:
########
QUAST Results
########
All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
Assembly Abal.1_1
# contigs (>= 0 bp) 37192295
# contigs (>= 1000 bp) 1276678
# contigs (>= 5000 bp) 529013
# contigs (>= 10000 bp) 343016
# contigs (>= 25000 bp) 145508
# contigs (>= 50000 bp) 46234
Total length (>= 0 bp) 18167382048
Total length (>= 1000 bp) 13017811908
Total length (>= 5000 bp) 11361640463
Total length (>= 10000 bp) 10034318481
Total length (>= 25000 bp) 6872368770
Total length (>= 50000 bp) 3406852776
# contigs 1887964
Largest contig 297427
Total length 13450974050
GC (%) 38.76
N50 25814
N75 9780
L50 139726
L75 348468
# N's per 100 kbp 1703.76
Any help how I can wrangle this beast into something presentable would be helpful.
I tried a cutoff to get rid of the contigs < 1000bp wich helped, but is there a way to rescaffold or similar?
I know plant genomes are messed up, but more than 3 billion contigs longer than 50kb? 18 billion contings in total?? That would put an estimate of genome size in the order of 1e14 base pairs or more. Definitely not a great assembly, if I'm not missing something important.
edit: the paper actually mentions 37 million scaffolds, for a total of 18 Gb, so maybe it would be better to start from there? (that is, scaffolds instead of contigs)