edirect esearch : exclude given accessions
2
2
Entering edit mode
6.2 years ago
erwan.scaon ▴ 950

Dear all,

I am trying to retrieve all Salmonella complete genomes from NCBI nucleotide DB (without redundancy).

This is my current try :

esearch -db nucleotide \
        -query "salmonella[organism] \
                AND complete genome[Title] \
                NOT contig[Title]" \
        | efetch -format fasta \
        > seqs.fasta;

This yield 828 seqs (date : 2018_09_21), 351 of them are "duplicated" sequences, i.e. NZ_XXXXX sequences, because they are all derived from the same exact sequence in NCBI nucleotide DB, except this other sequence is named XXXXX.
Example :

NZ_CP030203.1 Salmonella enterica strain SA20083530 chromosome, complete genome
CP030203.1 Salmonella enterica strain SA20083530 chromosome, complete genome

Thus, I wanted to try to filter them out with esearch terms :

esearch -db nucleotide \
        -query "salmonella[organism] \
                AND complete genome[Title] \
                NOT contig[Title] \
                NOT NZ_CP022017.1[ACCN]";

Count_827_Count

This worked, I can exclude a given NZ_XXXXX sequence.
But I fail to accomplish this for all NZ sequences :

Try based on NCBI book (ctrl+f "Taxonomy Search") :
Ps : Search fields reminder !

esearch -db nucleotide \
        -query "salmonella[organism] \
                AND complete genome[Title] \
                NOT contig[Title] \
                NOT NZ_0:NZ_999999999[ACCN]";

Count_828_Count

Another try based on this (ctrl+f "How can I identify genes with/without a known function") :

esearch -db nucleotide \
        -query "salmonella[organism] \
                AND complete genome[Title] \
                NOT contig[Title] \
                NOT NZ_*[ACCN]";

Count_828_Count

Any tips ? (I could filter sequences after the download, but I'd rather avoid this. Maybe some xtract magic ?)

edirect esearch ncbi • 3.3k views
ADD COMMENT
0
Entering edit mode

Ok, based on NZ_XX[0-9]* sequences IDs I have when downloading the non-filtered database, I came to this "dirty" solution :

esearch -db nucleotide \
        -query "salmonella[organism] \
                AND complete genome[Title] \
                NOT contig[Title] \
                NOT NZ_CP0:NZ_CP999999999[ACCN] \
                NOT NZ_AP0:NZ_AP999999999[ACCN] \
                NOT NZ_LN0:NZ_LN999999999[ACCN]";

Count_477_Count

Still not thrilled because I am under the impression that I am not grasping everything here.

ADD REPLY
0
Entering edit mode

I'll stop chatting with myself after this comment, but here is my final "dirty" solution :

After checking for additional "duplicates", I found that NC_XXXXX sequences are derived from an identical sequence entry in the DB (thus need to be removed).
Also found that "complete chromosome" is an acceptable appellation for complete genomes in nucleotide DB (given that we are dealing with a bacteria here).
Lastly, sequences titles containing both "complete genome" and "plasmid" all refer to the plasmid sequence (given their size), they need to be removed.

esearch -db nucleotide \
        -query "salmonella[organism] \
                AND (complete genome[Title] OR complete chromosome[Title]) \
                NOT (contig[Title] OR plasmid[Title]) \
                NOT (NZ_CP0:NZ_CP999999999[ACCN] OR NZ_AP0:NZ_AP999999999[ACCN] OR NZ_LN0:NZ_LN999999999[ACCN]) \
                NOT NC_0:NC_999999999[ACCN]";

Count_420_Count

ADD REPLY
0
Entering edit mode
esearch -db nucleotide -query "txid590 [organism] AND complete genome [TITLE]"|efilter -source genbank|efetch -format acc

This excludes all accession with NZ*, use efilter -source to pick sequences from -source genbank, insd, pdb, pir, refseq, swissprot, tpa

ADD REPLY
0
Entering edit mode

Any idea why this one "AM933172" is not returned with your command (despite being a GenBank accession) ?

ADD REPLY
0
Entering edit mode

I think you might have to use -source insd to ensure that all entries are listed.

esearch -db nucleotide -query "txid590 [ORGANISM] AND complete genome [TITLE]"|efilter -source insd|efetch -format acc

results in 496 entries.

ADD REPLY
0
Entering edit mode

Ok, this "efilter -source insd" is working well !

If I compare my "dirty" solution with yours :

taxid=$(esearch -db taxonomy \
                -query 'Salmonella' \
                | efetch -format xml \
                | xtract -pattern Taxon \
                         -element TaxId);

esearch -db nucleotide \
        -query "txid$taxid [Organism] \
                AND (complete genome[Title] OR complete chromosome[Title]) \
                NOT (contig[Title] OR plasmid[Title]) \
                NOT (NZ_CP0:NZ_CP999999999[Accession] OR NZ_AP0:NZ_AP999999999[Accession] OR NZ_LN0:NZ_LN999999999[Accession]) \
                NOT NC_0:NC_999999999[Accession]" \
        | efetch -format acc \
        | sort \
        > dirty.list;

esearch -db nucleotide \
        -query "txid$taxid [Organism] \
                AND (complete genome[Title] OR complete chromosome[Title]) \
                NOT (contig[Title] OR plasmid[Title])" \
        | efilter -source insd \
        | efetch -format acc \
        | sort \
        > sej.list;

comm dirty.list sej.list > comm.tsv;
printf "unique to dirty :\t"; awk -F "\t" '{if($1 != "") print $1}' comm.tsv | wc -l;
printf "unique to sej :\t"; awk -F "\t" '{if($2 != "") print $2}' comm.tsv | wc -l;
printf "common to both :\t"; awk -F "\t" '{if($3 != "") print $3}' comm.tsv | wc -l;

unique to dirty : 0
unique to sej : 0
common to both : 421

Now, what remain to be solved is my definition of "complete genome". I also want to capture all "chromosome level" assemblies found here (because they are full length genomes as far as I know).

ADD REPLY
0
Entering edit mode

I hope you noticed that taxonomy ID for Salmonella is txid590 and the link you have provided here is for Salmonella enterica (txid28901)

ADD REPLY
0
Entering edit mode

Yes, I am using the right taxid with the cmd below, which is returning 590

taxid=$(esearch -db taxonomy \
                -query 'Salmonella' \
                | efetch -format xml \
                | xtract -pattern Taxon \
                         -element TaxId);

Regarding the link for Salmonella enterica, it's the main species within genus Salmonella (the other two species to my knowledge are bongori & subterranea), thus it seems OK to navigate the NCBI genome DB for Salmonella enterica to gain some insight.

Anyway, shouldn't txid590 encompass txid28901 ?

ADD REPLY
0
Entering edit mode
6.2 years ago
GenoMax 147k

Get this prokaryotic genome reports file.

$ grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" |  awk 'BEGIN{FS="\t"}{print $21}' | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_genomic.fna.gz"}' | sh

As of now 423 genomes.

ADD COMMENT
0
Entering edit mode

This returned 9 FASTA files for me, for a total of 15 sequences.

My prokaryotes.txt file as 36129 lines.

Did I miss something ?

Edit : First download of "prokaryotes.txt" did break somehow, thus the weird results.

ADD REPLY
0
Entering edit mode

If you remove last |sh it should give you all wget commands. I am getting 423 lines which should each be a separate genome.

ADD REPLY
0
Entering edit mode
axel -q ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt;
grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" |  awk 'BEGIN{FS="\t"}{print $21}' | awk 'BEGIN{OFS=FS="/"}{print "wget "$0,$NF"_genomic.fna.gz"}' | wc -l;

458 !

ADD REPLY
0
Entering edit mode

Are you on macOS? I am using GNU grep. Check to see with just the two grep that you are getting complete Salmonella listings.

ADD REPLY
0
Entering edit mode

Ubuntu 16.04
grep (GNU grep) 2.25

grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" | head -2

Salmonella enterica subsp. enterica serovar Typhi str. CT18 220341 PRJNA236 236 Proteobacteria Gammaproteobacteria 5.13371 51.8776 chromosome:NC_003198.1/AL513382.1; plasmid pHCM1:NC_003384.1/AL513383.1; plasmid pHCM2:NC_003385.1/AL513384.1 - 3 4829 4473 2001/11/07 2017/04/14 Complete Genome Sanger Institute SAMEA1705914 GCA_000195995.1 REFR ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/195/995/GCA_000195995.1_ASM19599v1 11677608 CT18 Salmonella bongori NCTC 12419 218493 PRJNA351 351 Proteobacteria Gammaproteobacteria 4.46011 51.3 chromosome:NC_015761.1/FR877557.1 - 1 4382 4068 2011/07/06 2017/05/28 Complete Genome Sanger Institute SAMEA3138409 GCA_000252995.1 - ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/252/995/GCA_000252995.1_ASM25299v1 21876672 NCTC 12419

grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" | tail -2

Salmonella enterica subsp. salamae 59202 PRJEB6403 251923 Proteobacteria Gammaproteobacteria 4.88377 52 chromosome 1:NZ_LS483477.1/LS483477.1 - 1 4874 4540 2018/06/17 2018/06/25 Complete Genome SC SAMEA4028439 GCA_900478225.1 - ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/478/225/GCA_900478225.1_47328_C01 - NCTC10310 Salmonella enterica subsp. enterica serovar Infantis 595 PRJEB6403 251923 Proteobacteria Gammaproteobacteria 4.63034 52.3 chromosome 1:NZ_LS483479.1/LS483479.1 - 1 4689 4410 2018/06/17 2018/06/25 Complete Genome SC SAMEA4028447 GCA_900478235.1 - ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/478/235/GCA_900478235.1_49950_B01 - NCTC6703

grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" | wc -l

458

ADD REPLY
0
Entering edit mode

Here is what I get. This is close to 420 you had posted in solution above.

$ grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" |  head -2
Salmonella enterica subsp. enterica serovar Typhi str. CT18     220341  PRJNA236        236     Proteobacteria  Gammaproteobacteria     5.13371 51.8776 chromosome:NC_003198.1/AL513382.1; plasmid pHCM1:NC_003384.1/AL513383.1; plasmid pHCM2:NC_003385.1/AL513384.1      -       3       4829    4473    2001/11/07      2017/04/14      Complete Genome    Sanger Institute        SAMEA1705914    GCA_000195995.1 REFR    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/195/995/GCA_000195995.1_ASM19599v1       11677608  CT18
Salmonella bongori NCTC 12419   218493  PRJNA351        351     Proteobacteria  Gammaproteobacteria     4.46011 51.3    chromosome:NC_015761.1/FR877557.1       -       1 4382     4068    2011/07/06      2017/05/28      Complete Genome Sanger Institute        SAMEA3138409    GCA_000252995.1 -       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/252/995/GCA_000252995.1_ASM25299v1  21876672        NCTC 12419

$ grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" |  tail -2
Salmonella enterica subsp. enterica serovar Tennessee   143221  PRJNA414243     414243  Proteobacteria  Gammaproteobacteria     4.86259 52.3    chromosome:CP024164.1   - 4908     4328    2017/11/02      2017/11/02      Complete Genome FDA     SAMN07795411    GCA_002741865.1 -       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/741/865/GCA_002741865.1_ASM274186v1 -       CFSAN070645
Salmonella enterica subsp. enterica serovar Senftenberg 28150   PRJNA423165     423165  Proteobacteria  Gammaproteobacteria     5.15105 52.1114 chromosome:CP026379.1; plasmid pN17-509:CP026380.1 -       2       5361    4786    2018/02/04      2018/02/09      Complete Genome University of Zurich    SAMN08369945    GCA_002953175.1 -       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/953/175/GCA_002953175.1_ASM295317v1 -       N17-509

$ grep -w "Complete" prokaryotes.txt | grep -w "Salmonella" |  wc -l
423
ADD REPLY
0
Entering edit mode

Did you download "prokaroytes.txt" today ?

Do you mind sharing a file with full "grep -w "Complete" prokaryotes.txt | grep -w "Salmonella"" output so that we can run "comm" ?

Ps : I am already quite confident that the previous solution yielding 420 sequences is not good enough, given that it's not taking into account assemblies considered at "the chromosome level".

ADD REPLY
0
Entering edit mode

Yes I did.

Edit: Removed the link to a gist to keep this thread short.

ADD REPLY
0
Entering edit mode
6.2 years ago
erwan.scaon ▴ 950

So, here we are :
I want to retrieve all full length genome sequences of genus Salmonella.

Get Salmonella taxid :

taxid=$(esearch -db taxonomy \
                -query 'Salmonella' \
                | efetch -format xml \
                | xtract -pattern Taxon \
                         -element TaxId);

Retrive all sequences "labeled complete" associated with this taxid in NCBI nucleotide DB (minus some filters, like restriciton to source = insd)

esearch -db nucleotide \
        -query "txid590 [Organism] \
                AND (complete genome[Title] OR complete chromosome[Title]) \
                NOT (contig[Title] OR plasmid[Title])" \
        | efilter -source insd \
        | efetch -format fasta \
        > salmo.fasta;

This FASTA file contains 421 sequences.

Now, If you browse the NCBI genome DB here, for species enterica and bongori, you can go to "Genome Assembly and Annotation report". In there, all entries with assembly level "complete" or "chromosome" are of interest to us. According to these tables, bongori has 4 complete genomes and enterica 480 complete and 163 chromosome (647 total). Associated tables can be downloaded.
I wanted to know what I missed by not using this ressource :

Create a file with accesions numbers for sequences downloaded previously (salmo.fasta) :

awk -F " " '/^>/ {print $1}' salmo.fasta \
| awk '{sub(/\.[0-9]*/, "", $1)}1' \
| awk -F ">" '{print $2}' \
| sort -u \
> all_acc_salmo.list;

Grep vs genome tables to see what we retrieved :

for f in genome_db_tables;
do grep -i -o -f all_acc_salmo.list $f \
   | sort -u \
   | wc -l;
   grep -i -o -f all_acc_salmo.list $f \
   >> catched.list;
done;

ncbi_genome_db_Sbongori_complete.txt
4 (4 in file)
ncbi_genome_db_Senterica_chromosome.txt
6 (163 in file)
ncbi_genome_db_Senterica_complete.txt
402 (480 in file)

Now 2 things are of interest :

A) 402+6+4 = 412. But salmo.fasta contains 421 sequences. What are those sequences that I downloaded that are not listed in Salmonella genome DB tables ?

comm all_acc_salmo.list catched.list > comm.tsv;
printf "unique to salmo :\t"; awk -F "\t" '{if($1 != "") print $1}' comm.tsv | wc -l;
printf "unique to catched :\t"; awk -F "\t" '{if($2 != "") print $2}' comm.tsv | wc -l;
printf "common to both :\t"; awk -F "\t" '{if($3 != "") print $3}' comm.tsv | wc -l;

"unique to salmo 9"
List below :
CP017970
CP017971
CP017972
CP017973
CP017974
CP017975
CP017976
CP017977
CP017978

I still don't get why those sequences were downloaded in salmo.fasta but are not listed in genome DB.

B) Which sequences did I miss by not using this ressource ? I thus parsed theses tables to get associated accessions only, compared them to what I already downloaded and reduce the list accordingly to download what's missing. Below an example for the ncbi_genome_db_Senterica_chromosome.tx table :

awk -F "\t" 'NR>1 {print $11}' ncbi_genome_db_Senterica_chromosome.txt \
| awk -F ";" '{print $1}' \
| awk -F ":" '{print $2}' \
| awk -F "/" '/\// {print $2} !/\// {print $0}' \
| awk '{sub(/\.[0-9]*/, "", $1)}1' \
> all_acc_Senterica_chromosome.list;

grep -i -o -f all_acc_salmo.list ncbi_genome_db_Senterica_chromosome.txt \
| sort -u \
> catched_acc_Senterica_chromosome.list;

grep -v -f catched_acc_Senterica_chromosome.list all_acc_Senterica_chromosome.list \
> toDL_acc_Senterica_chromosome.list;

And same goes for ncbi_genome_db_Senterica_complete.txt.

Now I can try to download what I missed :

epost -db nucleotide \
      -input acc_toDL_sorted.list \
      -format acc \
      | efetch -format fasta \
      > missed.fasta;

It contains 236 sequences. I added them to salmo.fasta.

ADD COMMENT
1
Entering edit mode

Annotations/collection of meta data is by no means a perfect process and if you repeat this every week/month something will always change.

If you want to be that picky I would suggest only getting the genomes from RefSeq genomes. Those are human curated and will likely be the best candidates.

ADD REPLY
0
Entering edit mode

Yes, I get what you mean.

I just wanted to achieve my goal (retrieve all full length genome sequences of genus Salmonella in NCBI DBs) and understand how to do it best.

ADD REPLY

Login before adding your answer.

Traffic: 2637 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6