Downloading assemblies with description from NCBI with edirect
1
0
Entering edit mode
4.8 years ago
agata ▴ 10

I'm trying to get species and taxonomic family out of NCBI database for each of GCA_..... identification listed in list.txt file. But my output file is empty. It would be great if in my output file one row would look like: GCA_.... Species Family

As far I tried with this:

while IFS= read -r line
do
  esearch -db assembly -query "$line" | xtract -pattern DocumentSummary \
      -element ScientificName Division >> output.txt
done < ./list.txt

I also tried with this, but don't know how to add GCA to output:

while IFS= read -r line
do
esearch -db assembly -query $line | elink -target taxonomy | esummary | xtract -pattern DocumentSummary -element ScientificName Division AssemblyAccesion >> output.txt
done < ./list.txt
NCBI assembly entrez direct • 1.5k views
ADD COMMENT
0
Entering edit mode

Post example GCA#.

ADD REPLY
0
Entering edit mode

GCA_902705575 GCA_000002455 GCA_000002595 GCA_000002975 GCA_000018645 GCA_000090985 GCA_000091205 GCA_000092065 GCA_000143455

ADD REPLY
2
Entering edit mode
4.8 years ago
GenoMax 147k

There is most probably no way to do this as a single query. You may need to get taxID first

$ esearch -db assembly -query "GCA_902705575" | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,Organism,Taxid 
GCA_902705575.1 Ectocarpus sp. CCAP 1310/34 (brown algae)       867726

and then look up the family separately.

$ esearch -db assembly -query "GCA_902705575" | elink -target taxonomy | efetch -format native -mode xml | xtract -pattern Taxon -block "*/Taxon" -if Rank -equals "family" -element ScientificName
Ectocarpaceae

For some accessions, families do not seem to be defined so may have to settle for:

$ esearch -db assembly -query "GCA_000002455" | elink -target taxonomy | efetch -format native -mode xml | xtract -pattern Taxon -block "*/Taxon" -unless Rank -equals "no rank" -tab "\n" -element Rank,ScientificName
superkingdom    Eukaryota
phylum  Cercozoa
class   Chlorarachniophyceae
genus   Bigelowiella
ADD COMMENT
0
Entering edit mode

Thanks! It really helped me.

However, when I'm running it in the loop in the output.txt file I see only one (last) element. Do you have idea how to solve it?

ADD REPLY
0
Entering edit mode

Use epost method.

$ cat list.txt | epost -db assembly -format acc | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,Organism,Taxid >> name_acc.txt


$ cat list.txt | epost -db assembly -format acc | elink -target taxonomy | efetch -format native -mode xml | xtract -pattern Taxon -block "*/Taxon" -if Rank -equals "family" -element ScientificName >> out.txt
ADD REPLY
0
Entering edit mode

Do you know maybe how is it possible, that in file "name_acc.txt" I have more rows than in my input list with GCA accession numbers (list.txt)? My inputs list is 207 rows long, the output list is 223 long and contains additional GCF_.... (e.g GCF_000350225.1) - 30 of them. It's complicating joining those two outputs into one table at the end, as the rows probably not correspond to each other. Is it possible to find some common element that could be used later on as a key?

ADD REPLY
0
Entering edit mode

Use the accession number as the key.

ADD REPLY
0
Entering edit mode

That's a really great idea! But it's not printed in both output files, even when I'm adding AssemblyAccession to the second output file.

$ cat list.txt | epost -db assembly -format acc | elink -target taxonomy | efetch -format native -mode xml | xtract -pattern Taxon -block "*/Taxon" -if Rank -equals "family" -element ScientificName, AssemblyAccession >> out.txt
ADD REPLY
0
Entering edit mode

Because we are using elink that loses the original search term being used. One work around could be following

$ for i in `cat ./acc.txt`; do echo ${i}; esearch -db assembly -query ${i} | elink -target taxonomy | efetch -format native -mode xml | xtract -pattern Taxon -block "*/Taxon" -if Rank -equals "family" -element ScientificName; done
GCA_902705575
Ectocarpaceae
GCA_000002455
GCA_000002595
Chlamydomonadaceae
GCA_000002975
Geminigeraceae
GCA_000018645
Hemiselmidaceae
GCA_000090985
Mamiellaceae
GCA_000091205
Cyanidiaceae
GCA_000092065
Bathycoccaceae
GCA_000143455
Volvocaceae

Note that not all accessions numbers have family entries.

ADD REPLY
0
Entering edit mode

Would it be possible to get GCA number from input file printed in output file for the solution?

$ cat GCA_list.txt | epost -db assembly -format acc | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,Organism,Taxid >> name_acc_org.txt

As it was done here?

$ for i in `cat ./acc.txt`; do echo ${i}; esearch -db assembly -query ${i} | elink -target taxonomy | efetch -format native -mode xml | xtract -pattern Taxon -block "*/Taxon" -if Rank -equals "family" -element ScientificName; done

I've tried to join those two commands but none of my idea worked :(

ADD REPLY
0
Entering edit mode
$ for i in `cat ./acc.txt`; do echo ${i}; esearch -db assembly -query ${i} | esummary | xtract -pattern DocumentSummary -element AssemblyAccession,Organism,Taxid ; done
GCA_902705575
GCA_902705575.1 Ectocarpus sp. CCAP 1310/34 (brown algae)       867726
GCA_000002455
GCF_000002455.1 Bigelowiella natans (cercozoans)        227086
GCA_000002595
GCA_000002595.3 Chlamydomonas reinhardtii (green algae) 3055
GCF_000002595.1 Chlamydomonas reinhardtii (green algae) 3055
GCA_000002595.1 Chlamydomonas reinhardtii (green algae) 3055
GCA_000002975
GCF_000002975.1 Guillardia theta (cryptomonads) 55529
GCA_000018645
GCF_000018645.1 Hemiselmis andersenii (cryptomonads)    464988
ADD REPLY

Login before adding your answer.

Traffic: 2575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6