Dear all,
I am trying to retrieve all Salmonella complete genomes from NCBI nucleotide DB (without redundancy).
This is my current try :
esearch -db nucleotide \
-query "salmonella[organism] \
AND complete genome[Title] \
NOT contig[Title]" \
| efetch -format fasta \
> seqs.fasta;
This yield 828 seqs (date : 2018_09_21), 351 of them are "duplicated" sequences, i.e. NZ_XXXXX sequences, because they are all derived from the same exact sequence in NCBI nucleotide DB, except this other sequence is named XXXXX.
Example :
NZ_CP030203.1 Salmonella enterica strain SA20083530 chromosome, complete genome
CP030203.1 Salmonella enterica strain SA20083530 chromosome, complete genome
Thus, I wanted to try to filter them out with esearch terms :
esearch -db nucleotide \
-query "salmonella[organism] \
AND complete genome[Title] \
NOT contig[Title] \
NOT NZ_CP022017.1[ACCN]";
Count_827_Count
This worked, I can exclude a given NZ_XXXXX sequence.
But I fail to accomplish this for all NZ sequences :
Try based on NCBI book (ctrl+f "Taxonomy Search") :
Ps : Search fields reminder !
esearch -db nucleotide \
-query "salmonella[organism] \
AND complete genome[Title] \
NOT contig[Title] \
NOT NZ_0:NZ_999999999[ACCN]";
Count_828_Count
Another try based on this (ctrl+f "How can I identify genes with/without a known function") :
esearch -db nucleotide \
-query "salmonella[organism] \
AND complete genome[Title] \
NOT contig[Title] \
NOT NZ_*[ACCN]";
Count_828_Count
Any tips ? (I could filter sequences after the download, but I'd rather avoid this. Maybe some xtract magic ?)
Ok, based on NZ_XX[0-9]* sequences IDs I have when downloading the non-filtered database, I came to this "dirty" solution :
Still not thrilled because I am under the impression that I am not grasping everything here.
I'll stop chatting with myself after this comment, but here is my final "dirty" solution :
After checking for additional "duplicates", I found that NC_XXXXX sequences are derived from an identical sequence entry in the DB (thus need to be removed).
Also found that "complete chromosome" is an acceptable appellation for complete genomes in nucleotide DB (given that we are dealing with a bacteria here).
Lastly, sequences titles containing both "complete genome" and "plasmid" all refer to the plasmid sequence (given their size), they need to be removed.
This excludes all accession with NZ*, use
efilter -source
to pick sequences from-source genbank, insd, pdb, pir, refseq, swissprot, tpa
Any idea why this one "AM933172" is not returned with your command (despite being a GenBank accession) ?
I think you might have to use
-source insd
to ensure that all entries are listed.results in 496 entries.
Ok, this "efilter -source insd" is working well !
If I compare my "dirty" solution with yours :
Now, what remain to be solved is my definition of "complete genome". I also want to capture all "chromosome level" assemblies found here (because they are full length genomes as far as I know).
I hope you noticed that taxonomy ID for Salmonella is
txid590
and the link you have provided here is for Salmonella enterica (txid28901
)Yes, I am using the right taxid with the cmd below, which is returning
590
Regarding the link for Salmonella enterica, it's the main species within genus Salmonella (the other two species to my knowledge are bongori & subterranea), thus it seems OK to navigate the NCBI genome DB for Salmonella enterica to gain some insight.
Anyway, shouldn't
txid590
encompasstxid28901
?