Does someone know a script to calculate sequence statistics such as N50, number of contigs, contigs length etc for multiple bins? So far, the genome tools seqstat script only calculates this for one bin.
Does someone know a script to calculate sequence statistics such as N50, number of contigs, contigs length etc for multiple bins? So far, the genome tools seqstat script only calculates this for one bin.
The BBMap package has a tool "stats.sh" for calculating these statistics on an individual fasta. It also has another tool, "statswrapper.sh", that will calculate the statistics for multiple fasta files and output this information (assembly size, N50, L50, number of contigs, GC%, etc) as one tab-delimited line per fasta. I actually wrote it for comparing different assemblies of the same data, but it works for this purpose as well.
https://github.com/sanger-pathogens/assembly-stats
Example
$ assembly-stats Pf3D7_v3.fasta
stats for Pf3D7_v3.fasta
sum = 23328019, n = 16, ave = 1458001.19, largest = 3291936
N50 = 1687656, n = 5
N60 = 1472805, n = 7
N70 = 1445207, n = 8
N80 = 1343557, n = 10
N90 = 1067971, n = 12
N100 = 5967, n = 16
N_count = 0
Gaps = 0
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
perfect! format=5 did the trick for me thanks!
You're welcome! Just note that I use "N50" to describe a number of contigs (since Number starts with N) and "L50" to describe a length in bp (since Length starts with L). Some programs reverse that nomenclature for reasons that are opaque to me.