Retrieving assembly statistics for several accessions
1
0
Entering edit mode
6.3 years ago

Hey everyone,

I would like to download the global statistics for several assemblies. For example, on this link https://www.ncbi.nlm.nih.gov/assembly/GCF_000159415.1/, you can see the Global Statistics (sequence length, N50, number of contigs, etc.). I have a list of RefSeq accessions, like the one down:

GCF_000014425.1 GCF_000056065.1 GCF_000155915.2 GCF_000159335.1 GCF_000159355.1 GCF_000159415.1 GCF_000160855.1 GCF_000160875.1

Is it possible to extract Global Statistics for several RefSeq accessions? Probably using Entrez Direct. Cheers

Assembly sequence genome • 945 views
ADD COMMENT
2
Entering edit mode
6.3 years ago

using a XML transformation with xslt+xsltproc:

<?xml version='1.0' encoding="UTF-8" ?>
<xsl:stylesheet
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
xmlns:h='http://www.w3.org/1999/xhtml'
version='1.0'>
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates select="//h:table[@summary='Global statistics']/h:tbody/h:tr"/>
</xsl:template>
<xsl:template match="h:tr">
<xsl:value-of select="substring-before(/h:html/h:head/h:title/text(),'- Genome')"/>
<xsl:text> </xsl:text>
<xsl:value-of select="h:td[1]/text()"/>
<xsl:text> </xsl:text>
<xsl:value-of select="h:td[2]/text()"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
$ echo GCF_000014425.1 GCF_000056065.1 GCF_000155915.2 GCF_000159335.1 GCF_000159355.1 GCF_000159415.1 GCF_000160855.1 GCF_000160875.1 | tr " " "\n" |\
while read X; do wget -q -O -  "https://www.ncbi.nlm.nih.gov/assembly/${X}/"| xsltproc --novalid biostars365290.xsl - ; done

Output

ASM1442v1 	Total sequence length	1,894,360
ASM1442v1 	Total ungapped length	1,894,360
ASM1442v1 	Total number of chromosomes and plasmids	1
ASM5606v1 	Total sequence length	1,864,998
ASM5606v1 	Total ungapped length	1,864,998
ASM5606v1 	Total number of chromosomes and plasmids	1
Lacto_jensenii_1153_V2 	Total sequence length	1,746,219
Lacto_jensenii_1153_V2 	Total ungapped length	1,737,886
Lacto_jensenii_1153_V2 	Gaps between scaffolds	0
Lacto_jensenii_1153_V2 	Number of scaffolds	1
Lacto_jensenii_1153_V2 	Scaffold N50	1,746,219
Lacto_jensenii_1153_V2 	Scaffold L50	1
Lacto_jensenii_1153_V2 	Number of contigs	9
Lacto_jensenii_1153_V2 	Contig N50	903,202
Lacto_jensenii_1153_V2 	Contig L50	1
Lacto_jensenii_1153_V2 	Total number of chromosomes and plasmids	0
Lacto_jensenii_1153_V2 	Number of component sequences (WGS or clone)	9
ASM15933v1 	Total sequence length	1,604,632
ASM15933v1 	Total ungapped length	1,600,834
ASM15933v1 	Gaps between scaffolds	0
ASM15933v1 	Number of scaffolds	1
ASM15933v1 	Scaffold N50	1,604,632
ASM15933v1 	Scaffold L50	1
ASM15933v1 	Number of contigs	2
ASM15933v1 	Contig N50	832,947
ASM15933v1 	Contig L50	1
ASM15933v1 	Total number of chromosomes and plasmids	1
ASM15933v1 	Number of component sequences (WGS or clone)	2
ASM15935v1 	Total sequence length	1,780,499
ASM15935v1 	Total ungapped length	1,772,891
ASM15935v1 	Gaps between scaffolds	0
ASM15935v1 	Number of scaffolds	32
ASM15935v1 	Scaffold N50	1,030,662
ASM15935v1 	Scaffold L50	1
ASM15935v1 	Number of contigs	52
ASM15935v1 	Contig N50	139,200
ASM15935v1 	Contig L50	5
ASM15935v1 	Total number of chromosomes and plasmids	0
ASM15935v1 	Number of component sequences (WGS or clone)	52
ASM15941v1 	Total sequence length	2,248,406
ASM15941v1 	Total ungapped length	2,168,059
ASM15941v1 	Gaps between scaffolds	0
ASM15941v1 	Number of scaffolds	48
ASM15941v1 	Scaffold N50	475,421
ASM15941v1 	Scaffold L50	2
ASM15941v1 	Number of contigs	116
ASM15941v1 	Contig N50	44,004
ASM15941v1 	Contig L50	15
ASM15941v1 	Total number of chromosomes and plasmids	0
ASM15941v1 	Number of component sequences (WGS or clone)	116
ASM16085v1 	Total sequence length	2,020,582
ASM16085v1 	Total ungapped length	1,808,667
ASM16085v1 	Gaps between scaffolds	0
ASM16085v1 	Number of scaffolds	49
ASM16085v1 	Scaffold N50	288,743
ASM16085v1 	Scaffold L50	2
ASM16085v1 	Number of contigs	235
ASM16085v1 	Contig N50	15,586
ASM16085v1 	Contig L50	38
ASM16085v1 	Total number of chromosomes and plasmids	0
ASM16085v1 	Number of component sequences (WGS or clone)	235
ASM16087v1 	Total sequence length	1,277,649
ASM16087v1 	Total ungapped length	1,269,321
ASM16087v1 	Gaps between scaffolds	0
ASM16087v1 	Number of scaffolds	12
ASM16087v1 	Scaffold N50	246,521
ASM16087v1 	Scaffold L50	2
ASM16087v1 	Number of contigs	21
ASM16087v1 	Contig N50	246,521
ASM16087v1 	Contig L50	2
ASM16087v1 	Total number of chromosomes and plasmids	0
ASM16087v1 	Number of component sequences (WGS or clone)	22
view raw example.md hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

Thanks man! This is very helpful. I would like to show this data in a table. For example, first colum with descriptio, 2nd with sequence length, 3rd with no. of contigs, etc. Could you please let me know out to do that? Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6