Question

How to add gene symbol to RNA-Seq data using R

2

Entering edit mode

7.0 years ago

williamsbrian5064 ▴ 540

Hi,

So I am trying to add the gene ID to my RNA-Seq data. So what I did was use Salmon to quantify my reads but instead of doing some DE analysis, I want to try and us an alternative program to try and do a little machine learning. So I really just need add the gene symbol next to the transcript ID so I can easily identify the gene instead of having to look up the transcript ID in ensembl

I know there are some packages out there that can help with this like org.Cf.eg.db. I can't seem to figure out how to make the package work with transcript IDs though. I am inexperienced with the package so that is most likely my issue. Here is an example of what my data looks like after it was quantified.

> head(test)
                  Name Length EffectiveLength      TPM   NumReads
1 ENSCAFT00000034820.1    957         736.829 1309.272  43423.000
2 ENSCAFT00000034824.1   1044         823.630 1001.516  37129.000
3 ENSCAFT00000034830.1   1545        1324.630 3796.436 226357.000
4 ENSCAFT00000034833.1    684         464.046 8086.686 168910.000
5 ENSCAFT00000034835.1    204          50.476 4033.303   9163.596
6 ENSCAFT00000034836.1    681         461.059 9748.035 202300.391

I really just want to add the gene symbol so the data looks something like this

> head(test)
                  Name Length EffectiveLength      TPM   NumReads Symbol
1 ENSCAFT00000034820.1    957         736.829 1309.272  43423.000 ABC
2 ENSCAFT00000034824.1   1044         823.630 1001.516  37129.000 ABD
3 ENSCAFT00000034830.1   1545        1324.630 3796.436 226357.000 ABE
4 ENSCAFT00000034833.1    684         464.046 8086.686 168910.000 ABF
5 ENSCAFT00000034835.1    204          50.476 4033.303   9163.596 ABG
6 ENSCAFT00000034836.1    681         461.059 9748.035 202300.391 ABH

R rna-seq • 7.0k views

ADD COMMENT • link updated 7.0 years ago by Ram 45k • written 7.0 years ago by williamsbrian5064 ▴ 540

score 17 · Accepted Answer · 2018-09-10

you can use R package "biomaRt" to annotate you transcript id to gene name. see if the below code works

library( "biomaRt" )
mart = useMart('ensembl')
# list all the ensembl database of organisms
listDatasets(mart)  
#choose database of your interest ; in this case its "cfamiliaris_gene_ensembl" I guess
ensembl = useMart( "ensembl", dataset = "cfamiliaris_gene_ensembl" )  
# choose attributes of your interest
listAttributes(ensembl)
gene <- getBM( attributes = c("ensembl_transcript_id","external_gene_name"),values = test$Name,mart = ensembl)  
#Macth your transcript id with ensembl_transcript_id
id <- match(test$Name , gene$ensembl_transcript_id)
#Add Gene symbol column in your data frame
test$Symbol <- gene$external_gene_name[id]
head(test)