Question

From Sequence To Gene Id

3

Entering edit mode

13.8 years ago

Eric Normandeau 11k

Hi,

I have around 10000 fish EST sequences in a fasta file and want to have an Entrez gene ID for as many as possible of these sequences. The reason I want Entrez gene IDs is to facilitate gene ontology searches and analyses.

The traditional approach I used to do for these is to blast on the swissprot and nr databases, retrieve the identifiers and convert them into Entrez gene ID. However, using different tools (David, UniProt conversion...), I typically retrieve only a small percentage of these.

How could I go efficiently from the EST sequences to Entrez gene IDs?

My goal is to be able to automatize the process and get the maximum number of gene IDs possible for my gene ontology analyses. If, alternatively, you know of an approach to get another, just as useful, gene identifier that would integrate well with gene ontology tools, I am also interested.

Thanks!

VGhpcyBpcyBzdWNoIGEgbm9vYiBxdWVzdGlvbiA6KQo=

entrez identifiers blast conversion • 7.7k views

ADD COMMENT • link updated 13.8 years ago by Larry_Parnell 16k • written 13.8 years ago by Eric Normandeau 11k

1

Entering edit mode

Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want.

ADD REPLY • link 13.8 years ago by Larry_Parnell 16k

0

Entering edit mode

Hi @Larry. I'll dig into my blastx documentation to see if I can get this link in my output. This may be the quickest way of doing it. Thanks!

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k

0

Entering edit mode

@Larry. If you care to add your comment as an answer, I will credit you the answer. I just used another output format for blastx and, of course, there is all the info I need. Thank you!

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k

0

Entering edit mode

Hello Eric,

Could you please help me? I am doing similar works to yours: I have more than 10000 pep seqs, local blastp to nr database, and plan to get the Entrez gene ID. However, using the option: -outfmt '6 qseqid qgi sseqid pident evalue', I did not see the Entrez gene ID from output. The output is like: Bv1_000310_ofuz XP_010669526.1 100.000 106 0 0 1 106 1 106 3.10e-69 215 Bv1_000320_hyix KMT20011.1 100.000 346 0 0 1 346 1 346 0.0 718

My command line is: blastp -query beta_extracted_fpkm_peps.txt -db /home/clingyun/CHEN/database/peps_13species_06152017.txt -num_threads 4 -num_descriptions 1 -num_alignments 1 -outfmt '6 qseqid qgi sseqid pident evalue' -out /home/clingyun/CHEN/beta_rnaseq/blastp_12species/beta_extracted_fpkm_peps_blastp_13species

I can see the hyperlink which included the 'gene id' from the website information of "XP_010669526.1". But, I did not see it from the local blastp output.

The "XP_010669526.1" is a NCBI Reference Sequence ID. I have tried to convert more than 100 of the IDs to Entrez gene id using DAVID, but none got gene ID. Could you please figure out how I can got the Entrez gene ID locally?? Thanks

Best wishes.

Chen Lingyun

ADD REPLY • link 8.1 years ago by clingyun ▴ 20

score 1 · Answer 1 · 2011-10-06

OK, at Eric's suggestion, here goes:

Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want. An example BLASTX output for a 180-bp query (where I altered two nucleotides) is below. Note the "GENE ID" field. This has the following hyperlink: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=80303&RID=8VAWAEY2015&log$=geneexplicitprot&blast_rank=1 and from this one could parse the EntrezGene ID at the "term=" part.

ref|NP_079478.1| EF-hand domain-containing protein D1 isoform 1 [Homo sapiens][?] Length=239[?]

GENE ID: 80303 EFHD1 | EF-hand domain family, member D1 [Homo sapiens][?] (10 or fewer PubMed links)[?]

Score = 52.4 bits (124), Expect = 6e-10[?] Identities = 58/59 (98%), Positives = 58/59 (98%), Gaps = 0/59 (0%)[?] Frame = +2[?]

Query  2    IKDLESMFKLYDVGRDGFIDlmelklmmeklGAPQTHLGLKSMIKEVDEDFDGKLSFRE  178
            IKDLESMFKLYD GRDGFIDLMELKLMMEKLGAPQTHLGLKSMIKEVDEDFDGKLSFRE
Sbjct  92   IKDLESMFKLYDAGRDGFIDLMELKLMMEKLGAPQTHLGLKSMIKEVDEDFDGKLSFRE  150

score 0 · Answer 2 · 2011-10-05

0

Entering edit mode

13.8 years ago

Casey Bergman 18k

This sounds like a job for blast2go: http://www.blast2go.org/

ADD COMMENT • link 13.8 years ago by Casey Bergman 18k

0

Entering edit mode

Hi @Casey. I am already using blats2go for gene ontology analyses. However, I am trying a new R package, called WGCNA http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/, and I need gene IDs. They suggest Entrez gene IDs and I don't know how I can get these from blast2go. Is it possible to do it? This would integrate pretty well with my current pipeline if it dit :) Cheers

ADD REPLY • link 13.8 years ago by Eric Normandeau 11k