Entering edit mode
5.1 years ago
el97004
▴
80
Hi! I have noticed some differences in resulting assembly statistics from Abyss and BBMap stats.sh and was wondering if anyone knew why. For example, this is an output I get from Abyss:
n n:500 L50 min N80 N50 N20 E-size max sum name
3854 1282 71 500 2119 17327 40129 23269 95498 4735231 unitigs.fa
3448 997 78 500 10022 27708 46504 31492 108468 6954249 contigs.fa
3367 945 70 500 12423 30013 61035 35301 108468 6952419 scaffolds.fa
If it take the scaffolds.fa file and run BBMap stats.sh on it: stats.sh in=scaffolds.fa
Here are the resulting values from bbmap:
Main genome scaffold total: 3367
Main genome contig total: 3391
Main genome scaffold sequence total: 7.679 MB
Main genome contig sequence total: 7.678 MB 0.020% gap
Main genome scaffold N/L50: 82/27.296 KB
Main genome contig N/L50: 88/24.476 KB
Main genome scaffold N/L90: 864/549
Main genome contig N/L90: 891/547
Max scaffold length: 108.468 KB
Max contig length: 108.468 KB
Number of scaffolds > 50 KB: 24
% main genome in scaffolds > 50 KB: 22.09%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 3,367 3,391 7,679,180 7,677,680 99.98%
100 3,367 3,391 7,679,180 7,677,680 99.98%
250 2,878 2,902 7,594,310 7,592,810 99.98%
**500 945 969 6,953,926 6,952,429 99.98%**
1 KB 415 439 6,574,059 6,572,562 99.98%
2.5 KB 327 350 6,421,782 6,420,460 99.98%
5 KB 250 270 6,154,098 6,153,098 99.98%
10 KB 186 201 5,680,411 5,679,661 99.99%
25 KB 85 96 3,920,558 3,920,008 99.99%
50 KB 24 32 1,696,196 1,695,796 99.98%
100 KB 2 2 211,268 211,268 100.00%
As you can see, the contig and scaffold N50s/L50s are close but not identical. In addition, the total scaffold/contig lengths (for minimum scaffold length=500, Abyss uses a minimum of 500bp) are close but not identical. Has anyone seen this before and can shed some light?
Thank you.
ABySS uses a specific approach to calculate those stats (as pointed out here already).
There are a few issues on this topic on the abyss github repo, eg:
For searching use the term "abyss-fac" as this the tool/step from the abyss pipeline that does the actual calculations
Keep in mind that, by default, BBmap
stats.sh
requires at least 10 consecutive Ns between two contigs to consider it a scaffold. Also,stats.sh
is likely considering all contigs/scaffolds to calculate N/L50. Usually, contigs shorter than 250 or 500 bp are remove from draft assemblies, and I think you should not consider them to calculate assembly statistics.Thanks for your reply, alex.zaccaron. I think I know how to solve the second item you mention (I will filter for scaffolds > 500 bp and re reun BBmap
stats.sh
), for the first item, do you know what I should modify this value to in BBmapstats.sh
so that it matches that of abyss?You can change the parameter
n
withinstats.s
to adjust the required number of contiguous Ns in order to consider the sequence a scaffold instead of a contig. For example, if you specifyn=1
thenstats.sh
will "break" a contig at every single N. I am not sure what ABySS considers, but has to be between 1 and 10. You could runstats.sh
a few times with different values ofn
to see when it reports the same number of contigs as ABySS.Thanks! I tried all values of n between 1-10 but unfortunately cannot get the same number of contigs as in abyss (=997), the closest I got was 970 at n=1.
Edit: I wonder if the contig statistics are off because I am using the scaffolds file as input to BBmap stats.sh..what if this is the only file that one has?