Bioinformatics: Converting EntrezGeneID to FASTA entry for BLAST alignment
1
0
Entering edit mode
4.6 years ago
tom5 • 0

I am trying to set up protein BLAST alignment. I have a list of genes formatted as [EntrezGeneID_GeneSymbol], such as 395552_CCL1. I'm wondering if there's a way to convert these genes to FASTA entries so I can run protein BLAST+. Please let me know if you can help. I have a large number of genes so cannot do this process manually through the NCBI's online BLAST interface.

RNA-Seq fasta BLAST R • 1.9k views
ADD COMMENT
1
Entering edit mode
4.6 years ago
vkkodali_ncbi ★ 3.8k

You can use Entrez Direct for this. First you need to get a list of Entrez GeneIDs as a list. If all entries in your genes list are formatted as EntrezGeneID_GeneSymbol, you can do something like cut -f1 -d'_' <input_file> to get a list of Entrez GeneIDs. Then you can use Entrez Direct on the Unix command line as follows:

epost -db gene -input <geneid_file> \
  | elink -db gene -target protein -name gene_protein_refseq \
  | efetch -format fasta \
  > output_file.fasta

Another approach is to use Batch Entrez. You still would have to generate a list of Entrez GeneIDs. Then, load that list to Batch Entrez -- make sure you select 'Gene' as the database -- and in the resulting page, use the 'Find related data' widget to choose Protein as shown in the following screenshot:

find_related_data

Once you click 'Find Items', you will be directed to a page with all of the proteins. Click on the 'Send To' button to download data in FASTA format:

download_data_fasta

If you would rather just use these for BLAST analysis directly, you can do so from the Send To menu by choosing 'Analysis Tool' > 'BLAST' as shown below:

send_to_blast

ADD COMMENT
0
Entering edit mode

Thank you! The entrez direct command worked like a charm. I also tried Batch Entrez but I think my file was too large, since the site did not end up loading the second page.

ADD REPLY
0
Entering edit mode

Hi, i am wondering if there is a way to edit this command so only the first Isoform is returned? Thank you!

ADD REPLY
0
Entering edit mode

If you want only one isoform for every protein coding gene, I recommend using RefSeq Select to pick one isoform per gene. As of now, the scope of RefSeq Select is limited to human and mouse only. If you are interested in other organisms, you will have to come up with your own set of specifications for picking one isoform per gene. Perhaps I may be able to help you better if you can specify exactly what you are interested in. For example, you can use Entrez Direct to generate a 2-column table with protein accession and size:

elink -db gene -id 5768 -target protein -name gene_protein_refseq \
  | esummary \
  | xtract -pattern DocumentSummary -element AccessionVersion,Slen

and then add gene_id to that table to pick the longest isoform for each protein. Depending on how big your starting set of gene IDs is and the taxonomic scope, you will have to tailor your approach.

ADD REPLY
0
Entering edit mode

Thank you for the reply. My overarching goal is to use BLAST as an intermediate step to find similar genes between organisms, some of which would not be matched by just comparing the gene names alone to differing annotation between species. I am therefore blasting chicken genes against the mouse genome. My goal is to return only one alignment per gene, ignoring isoforms. My set of gene IDs is around 850 genes.

Additionally, due to the large number of FASTA entries in my output, my terminal times out when running BLAST. Is there a way to modify the code so it returns multiple FASTA files (one for each gene)? This way, I can pass these to BLAST via a loop and store output even if the program terminates early. Alternatively, if you have a better way to run large bLAST alignment queries locally please feel free to share.

ADD REPLY
0
Entering edit mode

For mouse, you are better off just using RefSeq Select. You can use the query Mus musculus[Organism] AND RefSeq_Select[Filter] in the NCBI Protein portal for that; see here.

For the 850 chicken genes, you can download the FASTA file for the largest protein for each gene in a separate FASTA file using Entrez Direct as follows:

 for gid in `cat gene_id_list.txt` ; do 
    elink -db gene -id ${gid} -target protein -name gene_protein_refseq \
    | esummary \
    | xtract -pattern DocumentSummary -element AccessionVersion,Slen \
    | sort -k2,2nr | head -n1 | cut -f1 \
    | epost -db protein -format acc \
    | efetch -format fasta > ${gid}.fa ; 
done
ADD REPLY
0
Entering edit mode

Hi, Unfortunately the above script isn't working for me for my gene list. It returns errors such as: WebEnv value not found in link output - WebEnv1 Db value not found in summary input Failure of post to find data to load Db value not found in fetch input

Here are a few entries from my input file of gene IDs: 396320 395771 396128 408047 417179 395970 422145 396526.

Will this work?

ADD REPLY
1
Entering edit mode

I am not sure why it isn't working for you. In am able to get the FASTA sequences for the genes you list. The gene_id_list.txt file should have one gene id per line. It won't work if you have them as space-delimited list.

$ cat gene_id_list.txt 
396320
395771
396128
408047
417179
395970
422145
396526

for gid in `cat gene_id_list.txt` ; do 
    elink -db gene -id ${gid} -target protein -name gene_protein_refseq \
    | esummary \
    | xtract -pattern DocumentSummary -element AccessionVersion,Slen \
    | sort -k2,2nr | head -n1 | cut -f1 \
    | epost -db protein -format acc \
    | efetch -format fasta > ${gid}.fa ; 
done

$ grep '^>' *.fa
395771.fa:>NP_990262.1 gap junction beta-6 protein [Gallus gallus]
395970.fa:>NP_990417.1 ferritin heavy chain [Gallus gallus]
396128.fa:>XP_015134573.1 cysteine and glycine-rich protein 2 isoform X2 [Gallus gallus]
396320.fa:>NP_990694.1 beta-tectorin precursor [Gallus gallus]
396526.fa:>NP_990849.1 actin, cytoplasmic 1 [Gallus gallus]
408047.fa:>NP_001001315.2 thymosin beta-4 [Gallus gallus]
417179.fa:>XP_415461.1 allograft inflammatory factor 1-like [Gallus gallus]
422145.fa:>NP_001012589.1 integral membrane protein 2A [Gallus gallus]
ADD REPLY
0
Entering edit mode

I am wondering if there's a way to get FASTA sequences (just like in the command above) but using gene symbols rather than gene IDs?

ADD REPLY
0
Entering edit mode

If you know which species the gene symbols are from and that they are the official symbols, you can use esearch with a query like this: "Gallus gallus"[Organism] AND GJB6[Gene Name]. Then you pipe the esearch results to elink just as above.

esearch -db gene -query '"Gallus gallus"[organism] AND GJB6[gene name]' \
  | elink -target protein -name gene_protein_refseq \
  | efetch -format acc 
NP_990262.1
XP_025000385.1
XP_015133186.1
XP_015133180.1
XP_015133175.1
ADD REPLY
0
Entering edit mode

I tried this command in the command line and it works well. Is there a way to call this BASH script from R, to make it simpler for another researcher to use?

ADD REPLY
0
Entering edit mode

There's REntrez package that may do the trick. I haven't used it myself as much so I cannot say how well it works.

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6