Find RefSeq-Entries with Protein sequences by whether the organisms are considered to be part of (specific) microbiome
1
1
Entering edit mode
11 weeks ago
Schmoho ▴ 10

I need to retrieve a dataset and frankly NCBI is just horribly complex. I hope someone can give me some hints on how to accomplish what I want to do. I will try to describe what I need, to the best of my understanding of what should be possible with NCBI Entrez:

  1. RefSeq entries of whole genomes
  2. that have translated protein sequences
  3. of organisms which are listed in PubMed-publications as parts of (specific, e.g. gastro-intestinal) microbiomes.

There is a MeSH-term (is "term" the correct terminology here?) for Microbiota, so I figure it should be possible to use this to restrict a PubMed-search.

I think what I want is to further restrict the PubMed search to entries that are linked in Entrez to entries in the Protein DB which also occur in RefSeq (in that RefSeq is actually just a subset of other databases?).

refseq microbiome • 684 views
ADD COMMENT
0
Entering edit mode

How about doing this the other way around. Identify species you are interested in. Find their refseq genome accessions. They should all have translated proteins.

Here is an example organism: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=28901

Here is RefSeq assembly for it: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000006945.2/

You can get the proteins from the FTP site (or use the Download button to get via datasets): https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/ (the .faa file)

ADD REPLY
0
Entering edit mode
11 weeks ago
Mensur Dlakic ★ 28k

Downloading many genomic datasets is trivial if you know their taxonomic groups, or taxonomic IDs.

https://github.com/pirovc/genome_updater

For example, this command would download RefSeq files (using 12 connections) for all bacteria that have complete genomes sequenced, and specifically it would get their genomic sequences. All files would be saved in genomes/bacteria within your local directory.

genome_updater.sh -d "refseq" -g "bacteria" -l "complete genome" -f "genomic.fna.gz" -o genomes/bacteria -t 12 -u -m -a -k

This command would get all translated protein sequences for taxonomic IDs specified in quotation marks after -T:

genome_updater.sh -d "refseq" -T "57723,200783,2498710,1930617,74152,49546,1090,142187" -f "protein.faa.gz" -o genomes/bacteria -t 10 -u -m -a -k

Taxonomic categories don't have to be individual species - you can download all phyla or families. The only thing you need is a specific taxonomic ID which can be found by searching NCBI's taxonomy.

I don't know how to couple this with PubMed information, and that part may have to be done manually.

ADD COMMENT

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6