Does anyone know any tools for calculating assembly statistics such as N50, L50,assembly size, number of contigs/scaffolds and GC%? Thanks in advance!
Does anyone know any tools for calculating assembly statistics such as N50, L50,assembly size, number of contigs/scaffolds and GC%? Thanks in advance!
From the GAAS toolkit you can use a simple perl script:
conda install -c bioconda gaas
gaas_fasta_statistics.pl -f genome.fa
You get that kind of output:
--------------------------------------------------------------------------------
| Arabidopsis_thaliana.TAIR10.dna.toplevel.fa |
| Analysis launched the 04/29/2020 at 09h32m36s |
|------------------------------------------------------------------------------|
| Nb of sequences | 7 |
|------------------------------------------------------------------------------|
| Nb of sequences >1kb | 7 |
|------------------------------------------------------------------------------|
| Nb of sequences >10kb | 7 |
|------------------------------------------------------------------------------|
| Nb of nucleotides (counting Ns) | 119667750 |
|------------------------------------------------------------------------------|
| Nb of nucleotides U | 0 |
|------------------------------------------------------------------------------|
| Nb of sequences with U nucleotides | 0 |
|------------------------------------------------------------------------------|
| Nb of IUPAC nucleotides | 469 |
|------------------------------------------------------------------------------|
| Nb of sequences with IUPAC nucleotides | 4 |
|------------------------------------------------------------------------------|
| Nb of Ns | 185738 |
|------------------------------------------------------------------------------|
| Nb of internal N-regions (possibly links between contigs)| 159 |
|------------------------------------------------------------------------------|
| Nb of long internal N-regions >10000 | |
| /!\ This is problematic for Genemark | 4 |
|------------------------------------------------------------------------------|
| Nb of pure (only) N sequences | 0 |
|------------------------------------------------------------------------------|
| Nb of sequences that begin or end with Ns | 3 |
|------------------------------------------------------------------------------|
| GC-content (%) | 36.0 |
|------------------------------------------------------------------------------|
| GC-content not counting Ns(%) | 36.1 |
|------------------------------------------------------------------------------|
| Nb of sequences with lowercase nucleotides | 0 |
|------------------------------------------------------------------------------|
| Nb of lowercase nucleotides | 0 |
|------------------------------------------------------------------------------|
| N50 | 23459830 |
|------------------------------------------------------------------------------|
| L50 | 3 |
|------------------------------------------------------------------------------|
| N90 | 18585056 |
|------------------------------------------------------------------------------|
| L90 | 5 |
|------------------------------------------------------------------------------|
This result is saved in the <result> directory along with plots in <pdf> format.
You should have a look at SQUAT
Try CheckM. Notice that their strain heterogeneity estimation might be a bit confusing ( for me it is counter intuitive).
I'm a big fan of stats.sh which is from Brian Bushnells' bbtools package, installable via bioconda.
Useful for long read FASTQ length assessments and assembly contig and scaffold stats.
stats.sh
Written by Brian Bushnell
Last modified December 7, 2017
Description: Generates basic assembly statistics such as scaffold count,
N50, L50, GC content, gap percent, etc. For multiple files, please use
statswrapper.sh. Works with fasta and fastq only (gzipped is fine).
Please read bbmap/docs/guides/StatsGuide.txt for more information.
Usage: stats.sh in=<file>
Parameters:
in=file Specify the input fasta file, or stdin.
gc=file Writes ACGTN content per scaffold to a file.
gchist=file Filename to output scaffold gc content histogram.
shist=file Filename to output cumulative scaffold length histogram.
gcbins=200 Number of bins for gc histogram.
n=10 Number of contiguous Ns to signify a break between contigs.
k=13 Estimate memory usage of BBMap with this kmer length.
minscaf=0 Ignore scaffolds shorter than this.
phs=f (printheaderstats) Set to true to print total size of headers.
n90=t (printn90) Print the N/L90 metrics.
extended=f Print additional metrics such as N/L90 and log sum.
logoffset=1000 Minimum length for calculating log sum.
logbase=2 Log base for calculating log sum.
pdl=f (printduplicatelines) Set to true to print lines in the
scaffold size table where the counts did not change.
n_=t This flag will prefix the terms 'contigs' and 'scaffolds'
with 'n_' in formats 3-6.
addname=f Adds a column for input file name, for formats 3-6.
format=<0-7> Format of the stats information; default 1.
format=0 prints no assembly stats.
format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.
format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.
format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.
format=4 is like 3 but with scaffold data only.
format=5 is like 3 but with contig data only.
format=6 is like 3 but the header starts with a #.
format=7 is like 1 but only prints contig info.
gcformat=<0-4> Select GC output format; default 1.
gcformat=0: (no base content info printed)
gcformat=1: name length A C G T N GC
gcformat=2: name GC
gcformat=4: name length GC
Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
I'll follow up what a few others have mentioned, but I like stats.sh within the BBTools package for raw assembly stats. I'll then polish my assembly and pare it down to the targeted contigs (eg. those annotated as bacteria/viruses/your genome of interest/etc.) then I like to use quast/metaquast to evaluate those contigs/bins/etc. If you have a mock control or reference sequence for evaluation, you can give that to quast and it will show your contig alignment, assembly %, and several other options, as well as let you compare across assemblies, which is invaluable for evaluating your assembly quality.
examples:
stats.sh in=contigs.fasta out=assembly_stats.txt gchist=gchist.txt shist=shist.txt score=t
A C G T N IUPAC Other GC GC_stdev
0.2576 0.2355 0.2359 0.2709 0.0000 0.0000 0.0000 0.4714 0.1067
Main genome scaffold total: 14645
Main genome contig total: 14645
Main genome scaffold sequence total: 9.225 MB
Main genome contig sequence total: 9.225 MB 0.000% gap
Main genome scaffold N/L50: 2725/901
Main genome contig N/L50: 2725/901
Main genome scaffold N/L90: 9896/303
Main genome contig N/L90: 9896/303
Main genome assembly score: 8.662
Max scaffold length: 83.033 KB
Max contig length: 83.033 KB
Number of scaffolds > 50 KB: 2
% main genome in scaffolds > 50 KB: 1.62%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 14,645 14,645 9,224,646 9,224,646 100.00%
100 14,645 14,645 9,224,646 9,224,646 100.00%
250 11,512 11,512 8,766,752 8,766,752 100.00%
500 5,983 5,983 6,788,048 6,788,048 100.00%
1 KB 2,336 2,336 4,246,150 4,246,150 100.00%
2.5 KB 282 282 1,268,180 1,268,180 100.00%
5 KB 39 39 467,306 467,306 100.00%
10 KB 9 9 277,465 277,465 100.00%
25 KB 3 3 185,822 185,822 100.00%
50 KB 2 2 149,603 149,603 100.00%
metaquast.py -o metaquast_sc_k121_norm_eval -r /Users/bbrow6/Desktop/InFANT/Viral_metagenome/assembly_eval/ATCC_MSA_viral_genomes/complete_fastas --gene-finding --min-identity 95 --threads 6 sc_k121_norm__OTUs_gt1000bps.fna
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You just have to google it and you will find the number of options out there.
Like I did and found quast.
Can anyone suggest me which tool can be used to find the total gap percentage in assembled genome file?