Download genomes within a given GC content interval
1
0
Entering edit mode
5.5 years ago

Hey guys,

Does anyone have a clue on how to download only complete genomes with a given GC content from NCBI? Let's say, download all complete genomes that have a GC content from 40 to 50. Thank you!

Assembly genome sequence • 1.1k views
ADD COMMENT
4
Entering edit mode
5.5 years ago
GenoMax 147k

You can find genome reports for various organisms from NCBI here.

Let us get the prokaryotic genome report.

If you parse this file you can get those genomes where GC% is between 40 and 50:

$ awk -F '\t' '{if ($8 >= 40 && $8 <= 50) print $1,"\t",$21}' prokaryotes.txt | head -5
Yersinia pestis CO92     ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/009/065/GCA_000009065.1_ASM906v1
Tropheryma whipplei str. Twist   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/485/GCA_000007485.1_ASM748v1
Actinobacillus pleuropneumoniae serovar 5b str. L20      ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/015/885/GCA_000015885.1_ASM1588v1
Chlamydia pneumoniae CWL029      ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/745/GCA_000008745.1_ASM874v1
Vibrio vulnificus        ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/215/135/GCA_002215135.1_ASM221513v1

In each of those directories you can find a *.fna.gz file with the genome sequence.

This variation should get you all the way to a downloadable URLs:

$ awk -F '/' '{print $temp"/"$10"_genomic.fna.gz"}' <(awk -F '\t' '{if ($8 >= 40 && $8 <= 50) print $21}' prokaryotes.txt; temp=$0) | head -5
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/009/065/GCA_000009065.1_ASM906v1/GCA_000009065.1_ASM906v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/007/485/GCA_000007485.1_ASM748v1/GCA_000007485.1_ASM748v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/015/885/GCA_000015885.1_ASM1588v1/GCA_000015885.1_ASM1588v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/745/GCA_000008745.1_ASM874v1/GCA_000008745.1_ASM874v1_genomic.fna.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/215/135/GCA_002215135.1_ASM221513v1/GCA_002215135.1_ASM221513v1_genomic.fna.gz
ADD COMMENT
0
Entering edit mode

Thanks, really appreciate that!

ADD REPLY

Login before adding your answer.

Traffic: 2781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6