Question

Find RefSeq-Entries with Protein sequences by whether the organisms are considered to be part of (specific) microbiome

1

Entering edit mode

4 months ago

Schmoho ▴ 10

I need to retrieve a dataset and frankly NCBI is just horribly complex. I hope someone can give me some hints on how to accomplish what I want to do. I will try to describe what I need, to the best of my understanding of what should be possible with NCBI Entrez:

RefSeq entries of whole genomes
that have translated protein sequences
of organisms which are listed in PubMed-publications as parts of (specific, e.g. gastro-intestinal) microbiomes.

There is a MeSH-term (is "term" the correct terminology here?) for Microbiota, so I figure it should be possible to use this to restrict a PubMed-search.

I think what I want is to further restrict the PubMed search to entries that are linked in Entrez to entries in the Protein DB which also occur in RefSeq (in that RefSeq is actually just a subset of other databases?).

refseq microbiome • 722 views

ADD COMMENT • link updated 3 months ago by Mensur Dlakic ★ 28k • written 4 months ago by Schmoho ▴ 10

0

Entering edit mode

How about doing this the other way around. Identify species you are interested in. Find their refseq genome accessions. They should all have translated proteins.

Here is an example organism: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=28901

Here is RefSeq assembly for it: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006945.2/

You can get the proteins from the FTP site (or use the Download button to get via datasets): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/ (the .faa file)

ADD REPLY • link 4 months ago by GenoMax 148k

score 0 · Answer 1 · 2024-09-06

Downloading many genomic datasets is trivial if you know their taxonomic groups, or taxonomic IDs.

https://github.com/pirovc/genome_updater

For example, this command would download RefSeq files (using 12 connections) for all bacteria that have complete genomes sequenced, and specifically it would get their genomic sequences. All files would be saved in genomes/bacteria within your local directory.

genome_updater.sh -d "refseq" -g "bacteria" -l "complete genome" -f "genomic.fna.gz" -o genomes/bacteria -t 12 -u -m -a -k

This command would get all translated protein sequences for taxonomic IDs specified in quotation marks after -T:

genome_updater.sh -d "refseq" -T "57723,200783,2498710,1930617,74152,49546,1090,142187" -f "protein.faa.gz" -o genomes/bacteria -t 10 -u -m -a -k

Taxonomic categories don't have to be individual species - you can download all phyla or families. The only thing you need is a specific taxonomic ID which can be found by searching NCBI's taxonomy.

I don't know how to couple this with PubMed information, and that part may have to be done manually.