OK I think I got it now.
between here:
esearch -db assembly -query 'GCA_000005845' | \
elink -target nuccore -name assembly_nuccore_insdc | \
elink -target protein | \
and here:
esearch -db protein -query '"16S rRNA"[Protein name]' | \
efetch -format fasta_cds_na
There is a break, the previous results before esearch get discarded. esearch
begins anew. You can try this out yourself by manually entering that search term in Genbank:
https://www.https://www.ncbi.nlm.nih.gov/protein/?term=%2216S+rRNA%22%5BProtein+name%5D
You should get the same spider protein as above, it's just randomly the first result for your search term: https://www.ncbi.nlm.nih.gov/protein/2071305251
I learned this by adding efetch -format docsum
to every step, beginning with the first one.
Here's what you could do instead with replacing esearch by efilter, but I'm sure there's a nicer way of doing this:
esearch -db assembly -query 'GCA_000005845' | \
elink -target nuccore -name assembly_nuccore_insdc | \
elink -target protein | \
efilter -query '16S[Title]' | \
efetch -format fasta > all_my_16S_proteins.fasta
It's not perfect yet - I do get several proteins, but all of them at least have 16S in their title, and all of them come from that group of assemblies (E. coli MG1655).
Probably not exactly what you are looking for but try:
Do you have one or two of the genomes in the list? Just want to try out the command; I've been bitten by Entrez being fiddly with quotation marks and it always takes me some debugging runs to find them
While the following seems to work (E coli genome) the command does not work reliably (at lest for me, it spits up a bunch of errors on multiple tries, following is a successful run sans errors).
Hi GenoMax
This is what I am seeing as well. You provided an E. coli genome, but the 16S sequence is for a different entry, Trichonephila clavata. https://www.ncbi.nlm.nih.gov/protein/GFQ73155.1/
And when I provide a list, I get this same entry for them all. So it seems to not be the loop, but something with piping the different commands.
You can feed in starts and stops below to recover the actual sequence
(Ignore the column with all 9's below).
Another option