I have a vector (in R) of probes from an Affymetrix microarray. I would like to find the Ensembl ID, the gene name (hgnc), the gene length and the GC-content using the library BiomaRt in R. In order to do it, I do:
# Finding Ensembl IDs
data <- useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl")
ensemblids <- getBM(attributes=c("ensembl_gene_id"), filters=c("affy_hg_u133a"), values=probes, mart=data)
# Finding gene name (hgnc), gene length and GC-content
dframe <- getBM(attributes=c("hgnc_symbol", "percentage_gc_content"), filters=c("ensembl_gene_id"), values=ensemblids, mart=data)
However, as you see, I only obtain the gene name and the GC content because I do not find any attribute related in obtaining the gene length. Do you know how to solve this?
Another thing. In my vector I have 22.000 genes, but in ensemblids there are 16.000 Ensembl IDs. Why is it?
Neil is right. There isn't 1:1 mapping between Affy probes and Ensembl IDs. Some probes will map to the same gene, particularly if that gene is quite large. Depending on your chip, they may not map to genes at all. Another source of confusion may be the way that we handle probes in our database. We don't take the databases from Affy stating which probe goes with which gene. Instead we map the sequences of their probes to the genome and see where they map to genes. This may also lead to us reporting different genes to each probe than they do. There's a help page that explains this here.
2) The short, unsatisfying answer is that for various reasons, not every HGNC symbol maps directly to an Ensembl Gene ID. I'm sure Emily_Ensembl can tell you more about that.
Hi Neilfws, using cds_length (actually, it is not what I am looking for: gene length != CDS length) I obtain an error: Error in getBM(attributes = c("hgnc_symbol", "cds_length", "percentage_gc_content"), :
Query ERROR: caught BioMart::Exception::Usage: Attributes from multiple attribute pages are not allowed
There are different sections that you can get attributes from. To see how this is structured, have a look at the BioMart browser tool.
We don't actually have gene length as an attribute, but you can get the start and end coordinates, then just do some arithmetic. The start and end are in the same section as the other attributes you need, so you can get everything you need in a single query.
Google that error; it's quite common and means that you're trying to query tables that are not linked. You'll need to do 2 separate queries, then merge the results.
Thank you, Emily!