Question

A strategy to get Bacterial NCBI RefSeqs excluding contigs in an one file

0

Entering edit mode

10.0 years ago

akh22 ▴ 120

I've been trying to generate a single file containing all the bacterial RefSeq from NCBI. I followed this similar previous discussion: Ncbi Refseq Viral Genomes and slightly modified the perl script posted by bw: Ncbi Refseq Viral Genomes where I changed:

$organism = 'viruses' to $organism= 'bacteria'

This script picked up over 2 million sequences and then I realized that it was picking up a complete genomic assemblies and each contigs from shotgun sequencing as well. Any help will be appreciated to modify the existing bw's script to exclude contigs or any other alternative methods to accomplish this.

Thanks.

sequence RNA-Seq • 2.8k views

ADD COMMENT • link updated 3.8 years ago by Ram 44k • written 10.0 years ago by akh22 ▴ 120

Ram · Answer 1 · 2015-01-11

I've been trying to generate a single file containing all the bacterial RefSeq

I would not recommend to store all bacterial full-length genomic sequences in a single FASTA file. You will not be able to handle such a huge file efficiently in praxis. For any large data collection you need an index. The most easiest way to create an index is exploiting the file system. Create a directory and store each sequence in a separate file. Then the filenames in that directory are the index.

$organism='viruses' to $organism='bacteria'

You have to find an appropriate query term for Eutils which will result only the sequences you are interested in.

'Bacteria[Organism]' will restrict search to eubacterial sequences
'complete[Properties]' will restrict search to sequences tagged as complete (including WGS)
'WGS[Properties]' will restrict search to contigs from WGS genomes
'srcdb_refseq[prop]' will restrict search to sequences which have been promoted into the redundant NCBI refsequence database

Thus you may use the query "Bacteria[Organism] AND complete[Properties] NOT WGS[Properties] AND srcdb_refseq[prop]". You can try it on the command line:

wget -O - 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&rettype=count&term=Bacteria[Organism]+AND+complete[Properties]+NOT+WGS[Properties]+AND+srcdb_refseq[prop]'

<eSearchResult>
        <Count>10631</Count>
</eSearchResult>