I have around 10000 fish EST sequences in a fasta file and want to have an Entrez gene ID for as many as possible of these sequences. The reason I want Entrez gene IDs is to facilitate gene ontology searches and analyses.
The traditional approach I used to do for these is to blast on the swissprot and nr databases, retrieve the identifiers and convert them into Entrez gene ID. However, using different tools (David, UniProt conversion...), I typically retrieve only a small percentage of these.
How could I go efficiently from the EST sequences to Entrez gene IDs?
My goal is to be able to automatize the process and get the maximum number of gene IDs possible for my gene ontology analyses. If, alternatively, you know of an approach to get another, just as useful, gene identifier that would integrate well with gene ontology tools, I am also interested.
Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want.
@Larry. If you care to add your comment as an answer, I will credit you the answer. I just used another output format for blastx and, of course, there is all the info I need. Thank you!
Could you please help me?
I am doing similar works to yours: I have more than 10000 pep seqs, local blastp to nr database, and plan to get the Entrez gene ID.
However, using the option: -outfmt '6 qseqid qgi sseqid pident evalue', I did not see the Entrez gene ID from output.
The output is like:
Bv1_000310_ofuz XP_010669526.1 100.000 106 0 0 1 106 1 106 3.10e-69 215
Bv1_000320_hyix KMT20011.1 100.000 346 0 0 1 346 1 346 0.0 718
My command line is:
blastp -query beta_extracted_fpkm_peps.txt -db /home/clingyun/CHEN/database/peps_13species_06152017.txt -num_threads 4 -num_descriptions 1 -num_alignments 1 -outfmt '6 qseqid qgi sseqid pident evalue' -out /home/clingyun/CHEN/beta_rnaseq/blastp_12species/beta_extracted_fpkm_peps_blastp_13species
I can see the hyperlink which included the 'gene id' from the website information of "XP_010669526.1". But, I did not see it from the local blastp output.
The "XP_010669526.1" is a NCBI Reference Sequence ID. I have tried to convert more than 100 of the IDs to Entrez gene id using DAVID, but none got gene ID.
Could you please figure out how I can got the Entrez gene ID locally??
Thanks
Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want. An example BLASTX output for a 180-bp query (where I altered two nucleotides) is below. Note the "GENE ID" field. This has the following hyperlink: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=search&term=80303&RID=8VAWAEY2015&log$=geneexplicitprot&blast_rank=1 and from this one could parse the EntrezGene ID at the "term=" part.
ref|NP_079478.1| EF-hand domain-containing protein D1 isoform 1 [Homo sapiens][?]
Length=239[?]
GENE ID: 80303 EFHD1 | EF-hand domain family, member D1 [Homo sapiens][?]
(10 or fewer PubMed links)[?]
Hi @Casey. I am already using blats2go for gene ontology analyses. However, I am trying a new R package, called WGCNA http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA/, and I need gene IDs. They suggest Entrez gene IDs and I don't know how I can get these from blast2go. Is it possible to do it? This would integrate pretty well with my current pipeline if it dit :) Cheers
Running BLAST on the NCBI website does return a hyperlink to the EntrezGene entry. Perhaps you can also see this when you run BLAST locally? That hyperlink could be parsed to pull out the ID you want.
Hi @Larry. I'll dig into my blastx documentation to see if I can get this link in my output. This may be the quickest way of doing it. Thanks!
@Larry. If you care to add your comment as an answer, I will credit you the answer. I just used another output format for blastx and, of course, there is all the info I need. Thank you!
Hello Eric,
Could you please help me? I am doing similar works to yours: I have more than 10000 pep seqs, local blastp to nr database, and plan to get the Entrez gene ID. However, using the option: -outfmt '6 qseqid qgi sseqid pident evalue', I did not see the Entrez gene ID from output. The output is like: Bv1_000310_ofuz XP_010669526.1 100.000 106 0 0 1 106 1 106 3.10e-69 215 Bv1_000320_hyix KMT20011.1 100.000 346 0 0 1 346 1 346 0.0 718
My command line is: blastp -query beta_extracted_fpkm_peps.txt -db /home/clingyun/CHEN/database/peps_13species_06152017.txt -num_threads 4 -num_descriptions 1 -num_alignments 1 -outfmt '6 qseqid qgi sseqid pident evalue' -out /home/clingyun/CHEN/beta_rnaseq/blastp_12species/beta_extracted_fpkm_peps_blastp_13species
I can see the hyperlink which included the 'gene id' from the website information of "XP_010669526.1". But, I did not see it from the local blastp output.
The "XP_010669526.1" is a NCBI Reference Sequence ID. I have tried to convert more than 100 of the IDs to Entrez gene id using DAVID, but none got gene ID. Could you please figure out how I can got the Entrez gene ID locally?? Thanks
Best wishes.
Chen Lingyun