From Sequence To Gene Id
2
3
Entering edit mode
13.2 years ago

Hi,

I have around 10000 fish EST sequences in a fasta file and want to have an Entrez gene ID for as many as possible of these sequences. The reason I want Entrez gene IDs is to facilitate gene ontology searches and analyses.

The traditional approach I used to do for these is to blast on the swissprot and nr databases, retrieve the identifiers and convert them into Entrez gene ID. However, using different tools (David, UniProt conversion...), I typically retrieve only a small percentage of these.

How could I go efficiently from the EST sequences to Entrez gene IDs?

My goal is to be able to automatize the process and get the maximum number of gene IDs possible for my gene ontology analyses. If, alternatively, you know of an approach to get another, just as useful, gene identifier that would integrate well with gene ontology tools, I am also interested.

Thanks!

VGhpcyBpcyBzdWNoIGEgbm9vYiBxdWVzdGlvbiA6KQo=

entrez identifiers blast conversion • 7.0k views
ADD COMMENT
1
Entering edit mode

Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want.

ADD REPLY
0
Entering edit mode

Hi @Larry. I'll dig into my blastx documentation to see if I can get this link in my output. This may be the quickest way of doing it. Thanks!

ADD REPLY
0
Entering edit mode

@Larry. If you care to add your comment as an answer, I will credit you the answer. I just used another output format for blastx and, of course, there is all the info I need. Thank you!

ADD REPLY
0
Entering edit mode

Hello Eric,

Could you please help me? I am doing similar works to yours: I have more than 10000 pep seqs, local blastp to nr database, and plan to get the Entrez gene ID. However, using the option: -outfmt '6 qseqid qgi sseqid pident evalue', I did not see the Entrez gene ID from output. The output is like: Bv1_000310_ofuz XP_010669526.1 100.000 106 0 0 1 106 1 106 3.10e-69 215 Bv1_000320_hyix KMT20011.1 100.000 346 0 0 1 346 1 346 0.0 718

My command line is: blastp -query beta_extracted_fpkm_peps.txt -db /home/clingyun/CHEN/database/peps_13species_06152017.txt -num_threads 4 -num_descriptions 1 -num_alignments 1 -outfmt '6 qseqid qgi sseqid pident evalue' -out /home/clingyun/CHEN/beta_rnaseq/blastp_12species/beta_extracted_fpkm_peps_blastp_13species

I can see the hyperlink which included the 'gene id' from the website information of "XP_010669526.1". But, I did not see it from the local blastp output.

The "XP_010669526.1" is a NCBI Reference Sequence ID. I have tried to convert more than 100 of the IDs to Entrez gene id using DAVID, but none got gene ID. Could you please figure out how I can got the Entrez gene ID locally?? Thanks

Best wishes.

Chen Lingyun

ADD REPLY
1
Entering edit mode
13.2 years ago

OK, at Eric's suggestion, here goes:

Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want. An example BLASTX output for a 180-bp query (where I altered two nucleotides) is below. Note the "GENE ID" field. This has the following hyperlink: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=80303&RID=8VAWAEY2015&log$=geneexplicitprot&blast_rank=1 and from this one could parse the EntrezGene ID at the "term=" part.

ref|NP_079478.1| EF-hand domain-containing protein D1 isoform 1 [Homo sapiens][?] Length=239[?]

GENE ID: 80303 EFHD1 | EF-hand domain family, member D1 [Homo sapiens][?] (10 or fewer PubMed links)[?]

Score = 52.4 bits (124), Expect = 6e-10[?] Identities = 58/59 (98%), Positives = 58/59 (98%), Gaps = 0/59 (0%)[?] Frame = +2[?]

Query  2    IKDLESMFKLYDVGRDGFIDlmelklmmeklGAPQTHLGLKSMIKEVDEDFDGKLSFRE  178
            IKDLESMFKLYD GRDGFIDLMELKLMMEKLGAPQTHLGLKSMIKEVDEDFDGKLSFRE
Sbjct  92   IKDLESMFKLYDAGRDGFIDLMELKLMMEKLGAPQTHLGLKSMIKEVDEDFDGKLSFRE  150
ADD COMMENT
0
Entering edit mode

@Larry Thanks. I used blastx locally including the following option: -outfmt '6 qseqid qgi sseqid pident evalue'

ADD REPLY
0
Entering edit mode

@Larry Thanks. I used blastx locally including the following option: -outfmt '6 qseqid qgi sseqid pident evalue' to get the information that I needed.

ADD REPLY
0
Entering edit mode
13.2 years ago

This sounds like a job for blast2go: http://www.blast2go.org/

ADD COMMENT
0
Entering edit mode

Hi @Casey. I am already using blats2go for gene ontology analyses. However, I am trying a new R package, called WGCNA http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/, and I need gene IDs. They suggest Entrez gene IDs and I don't know how I can get these from blast2go. Is it possible to do it? This would integrate pretty well with my current pipeline if it dit :) Cheers

ADD REPLY

Login before adding your answer.

Traffic: 2525 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6