How to get just protein_coding genes using biomart in R
2
1
Entering edit mode
9.0 years ago
M K ▴ 660

Dear all,

I would like to have help with getting just protein_coding genes from gene expression file using biomart. What I have is a file of all genes expression for mouse (mm10) with ensemble gene_names, and I need to get ride from other non-coding and pseudogene.

R sequencing rna-seq • 11k views
ADD COMMENT
4
Entering edit mode
4.0 years ago
rodd ▴ 250

To obtain all protein coding genes from chr 1-22, in an object called genes in R, using biomaRt, for example:

library(biomaRt)  
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = 'www.ensembl.org')
genes <- biomaRt::getBM(attributes = c("external_gene_name", "chromosome_name","transcript_biotype"), filters = c("transcript_biotype","chromosome_name"),values = list("protein_coding",c(1:22)), mart = mart)

You can just remove the chromosome_name filter and the 1:22 vector.

ADD COMMENT
0
Entering edit mode

Thanks @rodd. This was a simple answer to my immediate question and it was very helpful! I wasn't very familiar with biomaRt but I definitely should be.

ADD REPLY
1
Entering edit mode
9.0 years ago
cyril-cros ▴ 950

You can go to Ensembl Biomart, and select the following attributes in the Gene section: Gene type, Transcript type. "protein-coding" is the one you want. Just do something like grep "protein_coding" biomartResults.txt and you should be set.

ADD COMMENT
0
Entering edit mode

I already have my file with all genes on it, and I want to use R to get the protein_coding genes only from my file.

ADD REPLY
2
Entering edit mode

You can really use anything for that. If you want to do it in R:

library(biomaRt)
setwd("~/Downloads/")
shouldImport=TRUE
saveFile="proteinCodingMouseGenes.Rda"
if (!shouldImport || !file.exists(saveFile)){
  print("Querying Biomart for protein coding genes")
  ensembl=useMart("ensembl")
  ensemblMouse = useDataset("mmusculus_gene_ensembl",mart=ensembl)
  mouseProteinCodingGenes = getBM(attributes=c("ensembl_gene_id","external_gene_name","description"), filters='biotype', values=c('protein_coding'), mart=ensemblMouse)
  save(mouseProteinCodingGenes,file=saveFile)
} else {
  print("Loading genes from savefile")
  load(saveFile)
}

The only useful part is the one about ensembl, the rest just saves the result of your Biomart query to a file so it can be loaded again (querying Biomart takes a bit of time). biomaRt is the R library you want, you specify what mart you are using, request with getBM a list of the attributes of all the entries whose attributes in filters match the terms in values. listAttributes() does what it is called.

The rest is just dataframe manipulation.

ADD REPLY
0
Entering edit mode

I run this your R code and it worked, but we have to change ensembl=useMart("ensembl") with

myMart <- useMart("ENSEMBL_MART_ENSEMBL",dataset="mmusculus_gene_ensembl", host="www.ensembl.org")

because there are some changes in Ensembl proxy. the issue that I have that I couldn't read .Rda file. Is there any way to save this file as text, because what I need to do is using merge function in R to merge this file with my file to get only protein_coding genes in mine.

ADD REPLY
0
Entering edit mode

Thanks. Worked perfectly!

ADD REPLY

Login before adding your answer.

Traffic: 2420 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6