Hi everyone
I am working on a gene expression data set from TCGA, where genes are annotated with Ensembl IDs. I used Biomart to convert them to Entrez by using
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- getBM(
filters="ensembl_gene_id_version",
attributes=c("ensembl_gene_id", "entrezgene"),
values=genesens,
mart=mart)
But all I get is a list with the mapped IDs, while I want to add a column with Entrez to the corresponding Ensembl ID. Any ideas of how I should modify the above code?
Thank you in advance!
EDIT: Note that Ensembl in the initial data frame have dot suffix.
It sounds like you need to perform a
merge
withgenes
and your expression matrix. If the TCGA does not have a version number then you can remove it withgsub("\\.\\d+","", genes$ensembl_gene_id)
rina : You should take a look at @Mike Smith's answer here: A: Mapping Ensembl Gene IDs with dot suffix
some example data from genesens object would help @ rina
You are right.
Here are some ENSG00000000003.13 ENSG00000000005.5 ENSG00000000419.11 ENSG00000000457.12
with example ids and OP code, following is the result:
Output ensembl gene IDs have no suffix. If you would like to merge the data frames (data data frame and results data frame) , you can merge them by ensembl_gene_id. If you could post few lines from dataframe and results (with few matching rows), that would be helpful.
If you want to add, gene symbol at the end, add 'hgnc_symbol' to the attribultes list.
Data frameĀ“s first column has Ensembl IDs such as the following. Rest of the columns are raw counts of expression data
The results I get after the mapping look like this.
Well, there are ways to join the data frames using fuzzy logic or with some hacks. with some hacks (easy way): (note: genes is the list of ensembl example genes posted above and genesens is result from biomart)
With fuzzy logic, it would be:
Thank you so much for your help! The entrezgene is an integer and left join can only used to characters. Should I just convert it with the toString function? Excuse my very basic question, but I am just starting working with R.
Can you print the data structure of common columns between the two frames?
Expression matrix columns
that I turned into
by using
nth(tstrsplit(genes, split ="\\."),n=1)
Biomart result is the following matrix
Everything column is "character" except entrezgene that is an integer.
Then your merge is on ensembl_gene_id column (from the result) and x1 column from the data matrix. Entrezgene column str doesn't affect left_join
This is the reason I am confused when I get this message
And as the entrezgene column is the only one not being character I assumed this was the problem.
Input head:
results head:
data structure of results (ncbi entries are integers)
output:
check if there are conflicting packages with dplyr (among loaded packages) and also check the structure of common columns. For eg.
str(dat$X1)
andstr(results$ensembl_gene_id)
from the above example. Both must match.I was mistakenly putting the Ensembl ID column instead of the whole data frame as an argument to the left join function. It worked just fine now. Thanks for helping!
rin : If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they all work.
Note: I have moved @cpad0112's original comment to an answer to maintain the train of throught.
Looking at the NAs that came up after mapping to entrez, I randomly checked one (ENSG00000018607) and it is linked to an Entrez ID that was yet not found. Any ideas what might be the reason?
This is the code I used