I need to retrieve a dataset and frankly NCBI is just horribly complex. I hope someone can give me some hints on how to accomplish what I want to do. I will try to describe what I need, to the best of my understanding of what should be possible with NCBI Entrez:
- RefSeq entries of whole genomes
- that have translated protein sequences
- of organisms which are listed in PubMed-publications as parts of (specific, e.g. gastro-intestinal) microbiomes.
There is a MeSH-term (is "term" the correct terminology here?) for Microbiota, so I figure it should be possible to use this to restrict a PubMed-search.
I think what I want is to further restrict the PubMed search to entries that are linked in Entrez to entries in the Protein DB which also occur in RefSeq (in that RefSeq is actually just a subset of other databases?).
How about doing this the other way around. Identify species you are interested in. Find their refseq genome accessions. They should all have translated proteins.
Here is an example organism: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=28901
Here is RefSeq assembly for it: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006945.2/
You can get the proteins from the FTP site (or use the
Download
button to get via datasets): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/ (the.faa
file)