Question

How to get Gene symbols & nuclotide FASTA for taxid :1239

0

Entering edit mode

6.8 years ago

anu014 ▴ 190

Hello Biostars!

I was trying to get Gene symbols for taxon 1239 (Firmicutes) from refseq_protein ids, but was unable to do so using Biodbnet (https://biodbnet-abcc.ncifcrf.gov/db/db2db.php). Eg. 'WP_020487904.1' (https://www.ncbi.nlm.nih.gov/protein/521976633/).

Even gene2refseq file - ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz doesn't contain tax1239 or 1000277 (1239's species).

Can anyone tell me how to get all the genes & their respective FASTAs for Firmicutes ?

gene sequence genome • 2.4k views

ADD COMMENT • link updated 6.8 years ago by tdmurphy ▴ 230 • written 6.8 years ago by anu014 ▴ 190

0

Entering edit mode

Do you need Gene symbol or gene sequences in the fasta format? Do you need this data for txid1239 or txid1000277?

For example, gene symbol info will be included in the gene table can be downloaded using following NCBI Unix eutils command.

esearch -db gene -query "txid1239[Organism:exp] "|efetch -format tabular

ADD REPLY • link 6.8 years ago by Sej Modha 5.3k

0

Entering edit mode

I want gene symbols n fasta sequences if input is refseq protein ids for taxon id 1239.

ADD REPLY • link 6.8 years ago by anu014 ▴ 190

0

Entering edit mode

You can get the sequence by doing following:

esearch -db protein -query "txid1239[Organism:exp] "| efetch -format fasta > seq.fa

ADD REPLY • link 6.8 years ago by GenoMax 147k

0

Entering edit mode

I know it's primitive question but how to download esearch? It's throwing error : 'No command 'esearch' found' ...

ADD REPLY • link 6.8 years ago by anu014 ▴ 190

0

Entering edit mode

Okay I got it now. One can download edirect suit from here : ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/ . It contains esearch n efetch programs.

ADD REPLY • link 6.8 years ago by anu014 ▴ 190

0

Entering edit mode

Still it's not working @GenoMax. After running this it's showing me 'help' of efetch :

EFETCH - retrieve entries from sequence databases.

Synopsis: efetch -options [database:]<query>

Databases: SWissprot/SP, PIR, WOrmpep/WP, EMbl, GEnbank/GB, ProDom, ProSite

Options: -a Search with Accession number -f Fasta format output -q Sequence only output (one line) -s <#> Start at position # -e <#> Stop at position # -o More options and info...

-D <dir>      Specify database directory
-H            Display index header data
-p            Display entrynames in search path
-r            Print sequence in 'raw' format
-m            Fetch from mixed mini database
-M            Mini format output
-b            Do NOT reverse the order of bytes
                          (SunOS, IRIX do reverse, Alpha not)
-d <dbfile>   Specify database file (avoid this)
-i <idxfile>  Specify index file (avoid this)
-l <divfile>  Specify division lookup table (avoid this)
-B <database> Specify database (archaic)
-A            Only return entryname for accession number
-n <name>     Give the sequence this name
-x            Don't require query to match entry's name exactly (avoid)
-w            For Wormpep: also fetch cross-referenced SwissProt entry
-h            shows this help text

Environment: SWDIR = SwissProt directory - database and EMBL index files PIRDIR = PIR -- " -- WORMDIR = Wormpep -- " -- EMBLDIR = EMBL -- " -- GBDIR = Genbank -- " -- PRODOMDIR = ProDom -- " -- PROSITEDIR = ProSite -- " -- DBDIR = User's own -- " -- (fasta format)

SEQDB database file (default SwissProt) SEQDBIDX index file DIVTABL division lookup table

Ex. setenv DBDIR /pubseq/seqlibs/embl/

Note that Prodom family consensus seqs can be fetched by PD:_#

by Erik Sonnhammer (esr@sanger.ac.uk) Version 2.1,

ADD REPLY • link 6.8 years ago by anu014 ▴ 190

0

Entering edit mode

I am not sure if you are using the correct version of edirect utils. Download the latest version of the eutils from: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/versions/current. You can also have a look this blog for more info.

ADD REPLY • link 6.8 years ago by Sej Modha 5.3k

score 0 · Answer 1 · 2018-02-24

Many of the bacteria RefSeq genomes aren't available in NCBI's Gene database, so e-utils with the gene db won't work. If you have a specific set of assemblies in mind, try downloading the "feature_table.txt" files for that set and parsing what you need from there. e.g.: https://www.ncbi.nlm.nih.gov/assembly/?term=txid1239%5Borgn%5D+latest_refseq%5Bfilter%5D Then use the "download assemblies" button to download the "Feature table" file for the RefSeq assemblies. All Firmicutes is 35k assemblies and a 4.6GB download.

Your example protein is in this file: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/372/005/GCF_000372005.1_ASM37200v1/GCF_000372005.1_ASM37200v1_feature_table.txt.gz The genomic location is in columns 7-10, and the gene symbol (if available) is in column 15. You could then use e-utils to get the FASTA sequence for that genomic range.

If you want CDS nucleotide sequence (same as the gene sequence), with gene symbols in the FASTA headers, try the "CDS from genomic" file from that same download option (31.8 GB). Your example has a header like this:

>lcl|NZ_AQYY01000001.1_cds_WP_020487904.1_359 [gene=clpB] [locus_tag=A37G_RS0101875] [protein=ATP-dependent chaperone ClpB] [protein_id=WP_020487904.1] [location=424554..427151] [gbkey=CDS] ATGGACACCGACAAGCTGACGACCCGCAGCCGGGACGCGGTCTCGGCCGCCCTGCGCACCGCTCTGACGAAAGGCAACCC GGCGGCCGAGCCGGTGCACCTGCTGTACGCGTTGCTGCTGGTCCCCGACAACACGGTCGCGCCCCTGCTGGGCTCGATCG

To do that for individual proteins via e-utils, you could use something like:

# first use the IPG report to get the nucleotide accession and location
esearch -db protein -query WP_020487904 | esummary -format ipg | grep WP_020487904
41115784    RefSeq  NZ_AQYY01000001.1   424554  427151  +   WP_020487904.1  ATP-dependent chaperone ClpB    Dehalobacter sp. FTH1   FTH1    GCF_000372005.1

# then use that location from columns 3-6 to get the sequence:
efetch -db nuccore -id NZ_AQYY01000001.1 -seq_start 424554 -seq_stop 427151 -strand plus -format fasta_cds_na
>lcl|NZ_AQYY01000001.1_cds_WP_020487904.1_1 [gene=clpB] [locus_tag=A37G_RS0101875] [protein=ATP-dependent chaperone ClpB] [protein_id=WP_020487904.1] [location=424554..427151] [gbkey=CDS]
ATGGACACCGACAAGCTGACGACCCGCAGCCGGGACGCGGTCTCGGCCGCCCTGCGCACCGCTCTGACGA

Keep in mind a single WP may be found on multiple assemblies (or even at multiple locations of the same assembly), so the IPG report may have multiple rows for the same WP accession.

Note only about 10% of the genes for that assembly have gene symbols assigned. Protein names on WPs are better defined than gene symbols.