Dear BioStar communities,
I am analyzing RNA seq counts from GTEx website with degeR. The goana() function requires Entrez and RefSeq ids.
The original count data only contains ENSEMBL ids, so I need to map them to Entrez and RefSeq. The problem is one ENSEMBL id can map to multiple Entrez id, and one Entrez id can map to multiple RefSeq id. This makes it difficult to annotate "genes" in the "DGEList".
E.g.: Ensembl "ENSG00000223972" mapped to 4 Entrez ids: "84771" "727856" "100287102" and Entrez "84771" mapped to two RefSeq ids: "NR_024004" "NR_024005"
So my question is how to address the mapping problems? Which is not one-to-one mapping.
Alternatively, how can I perform GO enrichment analysis with given human Ensembl ids?
Thanks a lot.
Regards,
Jianhai
Hi,
If you got Gene Symbols (HGNC), use GeneSCF to get complete annotation for all your input genes. All ENSGs (Ensembl Genes) will have corresponding Gene Symbol (you can find in GTF or GFF3 from Ensembl). To avoid this problem I personally prefer to use only Ensembl IDs and Gene Symbols throughout the analysis and also maintain the same version of annotation.
Where are you getting your mappings? ENSG00000223972 is only Entrez ID 100287102, the rest are different genes.
Hello Devon Ryan,
The mapping is as below: library(org.Hs.eg.db) x <- as.list(org.Hs.egENSEMBL2EG); x["ENSG00000223972"]
Thanks. Jianhai
That R package apparently has some errors, since the example mapping is incorrect. Please report this upstream to the package maintainer.
Maybe you can use Biomart R package.When I use org.Hs.eg.db to map ENSEMBL id to gene symbol, I could get two gene symbols with one ENSEMBL id.
That's the reason to avoid confusion I asked you to use GTF/GFF3 to convert your ENSGs to Gene Symbols.
Hello everyone,
Thanks for all your reply.
In the original data, I already have gene symbols along Ensembl ids, no Entrez and RefSeq. My fundamental goal is GO enrichment analysis, preferably with these Ensembl ids. Does anyone have ideas?
Thanks.
Regards, Jianhai
Your problem seems to stem from mixing two different gene sets. You have to understand that different resources have different notions of what a gene is. EnsEMBL provides one set of genes as part of its annotation of the human genome. RefSeq on the other hand is just a collection of sequences, some of them assigned to genes. RefSeqGene is a subset of RefSeq that "defines genomic sequences to be used as reference standards for well-characterized genes". While EnsEMBL has at least an operational definition of what a gene is (roughly, a locus producing a set of related, overlapping transcripts), I still haven't found anything explaining what a gene is in RefSeq. As already suggested by EagleEye, my advice for data analysis is to decide on which genome reference you want to use for the project and stick to it.
Since you already have your Gene Symbols, you can use GeneSCF to perform enrichment analysis. I hope you noticed that in my last comment. If you have any difficulties in using GeneSCF, I am here to help you with it.
Please use
ADD COMMENT
orADD REPLY
to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your post but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.