how to download all the complete genomes for mycobacteria from NCBI?

how to download all the complete genomes for mycobacteria from NCBI?

0

Entering edit mode

7.5 years ago

Paul ▴ 80

How to download all the complete genomes for mycobacteria from NCBI?

I tried downloading the complete genomes from the NCBI site

ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/

But couldn't get the exact fasta files with respective mycobacteria. And https://www.ncbi.nlm.nih.gov/genome/?term=mycobacteria gave me 421 hits

genome NCBI sequence • 4.2k views

ADD COMMENT • link updated 3.5 years ago by Debut ▴ 20 • written 7.5 years ago by Paul ▴ 80

5

Entering edit mode

7.5 years ago

5heikki 11k

#Get GenBank assembly summary file
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt

#Get all lines that have "Mycobacter", if 12th field is "Complete Genome", print the 20th field (url to file).
#But the actual filename ends _genomic.fna.gz so include that too..
grep Mycobacter assembly_summary_genbank.txt \
    | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' \
    | awk 'BEGIN{OFS=FS="/"}{print $0,$NF"_genomic.fna.gz"}' \
    > urls.txt

#Now you can go through your urls file
IFS=$'\n'; for NEXT in $(cat urls.txt); do wget "$NEXT"; done

ADD COMMENT • link 7.5 years ago by 5heikki 11k

0

Entering edit mode

Thanks.. This worked

ADD REPLY • link 7.5 years ago by Paul ▴ 80

0

Entering edit mode

I tried your method but I have an empty urls.txt file. has the format changed please?

ADD REPLY • link 3.6 years ago by Debut ▴ 20

0

Entering edit mode

It hasn't changed. I just tried the above and see 2,481 Mycobacter genomes with the status "Complete Genome"..

ADD REPLY • link 3.6 years ago by 5heikki 11k

0

Entering edit mode

OKAY, THANK YOU FOR YOUR ANSWER.

ADD REPLY • link 3.6 years ago by Debut ▴ 20

0

Entering edit mode

$ grep klebsiella assembly_summary_genbank.txt | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' | wc -l
0
$ grep Klebsiella assembly_summary_genbank.txt | awk 'BEGIN{FS="\t"}{if($12=="Complete Genome"){print $20}}' | wc -l
1523

ADD REPLY • link 3.6 years ago by 5heikki 11k

0

Entering edit mode

please, is it possible to put all the output sequences in one file (file with several FASTA files) ?

ADD REPLY • link 3.5 years ago by Debut ▴ 20

0

Entering edit mode

$ ls
file1.fna  file2.fna
$ cat file1.fna
>seq1
aaaaaaaaaa
$ cat file2.fna
>seq2
gggggg
$ cat file1.fna file2.fna > file3.fna
$ cat file3.fna
>seq1
aaaaaaaaaa
>seq2
gggggg

ADD REPLY • link 3.5 years ago by 5heikki 11k

0

Entering edit mode

thank you very much for your answer. but i have 10668 outputs it doesn't have a command to add for example after "IFS=$'\n'; for NEXT in $(cat urls.txt); do wget "$NEXT"; done" i tried IFS=$'\n'; for NEXT in $(cat urls.txt); do wget "$NEXT"; done >doc.txt" it didn't work

ADD REPLY • link 3.5 years ago by Debut ▴ 20

0

Entering edit mode

The output files all end in ".gz", right?

So zcat *.gz > all.fna

zcat instead of cat because they're gz archieves

ADD REPLY • link 3.5 years ago by 5heikki 11k

0

Entering edit mode

Hi, I'm trying to do this with python, I've already loaded my table with pandas, and I'd like to do the same thing I've got the FTP Path back but I have to go from :""ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/316/945/GCA_001316945.3_ASM131694v3"""" to this : """ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/316/945/GCA_001316945.3_ASM131694v3/GCA_001316945.3_ASM131694v3_genomic.fna.gz""""" Thanks

ADD REPLY • link 3.5 years ago by Debut ▴ 20

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 1791 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6