Hi there,
I'm having a bit of trouble trying to find out how to construct a particular pipeline using the Entrez E-utilities. Specifically, what I want to do is, do a search in the Taxonomy database, i.e.
Insecta[ORGN] AND genus[RANK]
Then, for each ID returned in that search, find all its IDs in the Nucleotide database, but then filter those Nucleotide IDs with a Nucleotide query, which is this:
cox1[gene]
Then get the FASTA sequences. And I would like to preserve the mapping between the IDs, so ideally I would get something like this:
tax_id_1 --> nucleotide_id_1 --> fasta sequence
tax_id_2 --> nucleotide_id_2 --> fasta sequence
...
tax_id_n --> nucleotide_id_n --> fasta sequence
(Where tax_id_1
and tax_id_2
are IDs from the Taxonomy query, nucleotide_id_1
is the COX1 gene sequence for tax_id_1
, nucleotide_id_2
is the COX1 gene sequence for tax_id_2
, etc.)
At the moment I'm using Python to do this, and then decided to do this through the browser (just to keep things simple). I have used Elink to handle the Taxonomy query, Elink to map from the Taxonomy IDs to the Nucleotide IDs (preserving the one-to-one correspondence), but I'm stuck on how to then filter those Nucleotide IDs so that I only get COX1 as the gene. I have tried doing this previously with varying degrees of success, and even if I did manage to pull it off, I'd probably have done it in such an unelegant way!
How would you go about doing something like this?
Cheers
Thanks both of you, I did it the other way around and I think I got what I wanted now (with 1000x less hassle) by starting from the nucleotide database then working toward taxonomy (as per your suggestions). I'm still curious about whether it's possible to use E-Utilities to filter a list of IDs based on query (see my reply below scapella's response) but at this stage it's not a big deal at all.