How to add biological annotations using R
4
1
Entering edit mode
5.7 years ago
neranjan ▴ 70

Hi All,

Is there a R package that I can use to get the biological annotations?

I am using NOISeq package to analyze RNASeq data, and I am interested in adding additional biological annotations such as:

  • feature length
  • GC content
  • biological classification features
  • chromosome position

Question: Is it possible to direct me to how to get these information ? Can you let me know which packages that I can use to get the information.

Thank you very much.

R biological annotations RNA-Seq • 2.8k views
ADD COMMENT
1
Entering edit mode

Thank you I was able to use biomaRt package to grab the required sequences using:

g_ids=c("ENSG00000177757", "ENSG00000187634", "ENSG00000188976")
human <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
gene_coords=getBM(attributes=c("ensembl_gene_id", "start_position","end_position","percentage_gene_gc_content"), filters="ensembl_gene_id", values=g_ids, mart=human)
gene_coords$length=gene_coords$end_position - gene_coords$start_position

Which gave me:

  ensembl_gene_id start_position end_position percentage_gene_gc_content length
1 ENSG00000177757         817371       819837                      48.52   2466
2 ENSG00000187634         923928       944581                      66.58  20653
3 ENSG00000188976         944204       959309                      59.55  15105

But the gene start and end positions as well as length of gene is well away from what it should be as shown below: (where i am following the NOISeq tutorial)

                Length    GC        Biotype Chromosome GeneStart GeneEnd
ENSG00000177757 2464.0 48.58        lincRNA          1    742614  745077
ENSG00000187634 4985.0 65.99 protein_coding          1    850393  869824
ENSG00000188976 3870.5 59.55 protein_coding          1    869459  884494
ADD REPLY
0
Entering edit mode

You are using GRCh38, NOISeq tutorial GRCh37?

ADD REPLY
0
Entering edit mode

it is possible, but in NOISeq it does not say which it is been using.

But again, the difference is huge, so it is obvious that I am missing something in calculating the gene-length.

I also tried to calculate the mean, median and sum of transcript length, and which was way off than the number they provided in NOISeq

  transcript_length.median transcript_length.mean transcript_length.sum
1                 1947.000               1947.000              1947.000
2                 2159.000               1699.353             28889.000
3                 1244.500               1749.000             10494.000
ADD REPLY
1
Entering edit mode

I think your tutorial is very old. Add "version=75" to your useMart command, and you'll see gene coordinates close to the tutorials.

ADD REPLY
1
Entering edit mode

As a side note, you calculate the total length of the gene (exons and introns). I guess NOISeq is using only sum of exon length.

ADD REPLY
0
Entering edit mode

Thanks @swbarnes2 and @b.nota

I was able to replicate the NOISeq data length using a older version of mart, and also using transcript start and end locations to calculate the length.

So my final code would be:

en_2=useMart("ensembl", host = "http://feb2014.archive.ensembl.org")
human_2 = useDataset("hsapiens_gene_ensembl",mart=en_2)

g_ids_t=c("ENSG00000177757", "ENSG00000187634", "ENSG00000188976")
gene_coords=getBM(attributes=c("ensembl_gene_id", "start_position","end_position", "transcript_start", "transcript_end"), filters="ensembl_gene_id", values=g_ids_t, mart=human_2)
gene_coords$t_length = abs(gene_coords$transcript_start - gene_coords$transcript_end)

t_median_2 <- aggregate( t_length ~ ensembl_gene_id, gene_coords, plyr::each(median))

which gave me:

  ensembl_gene_id t_length
1 ENSG00000177757   2463.0
2 ENSG00000187634   4984.0
3 ENSG00000188976   3869.5

So I can take this as an answer.

As a side note: Which is off-set by 1 in each case, to the NOISeq data set: ( why ?)

                Length    GC        
ENSG00000177757 2464.0 
ENSG00000187634 4985.0 
ENSG00000188976 3870.5
ADD REPLY
2
Entering edit mode

The off-set of one is probably because of the one-based coordinate system used by ensembl. Add + 1 to your equation.

ADD REPLY
0
Entering edit mode

What are the three most common sources of programming errors? Poor naming, and 1-off errors.

ADD REPLY
2
Entering edit mode
5.7 years ago
Benn 8.3k

I think biomaRt can get you far, please read the manual.

ADD COMMENT
1
Entering edit mode
5.7 years ago
Ashastry ▴ 60

BiomaRt is my first option. Although, I occasionally use AnnotationDBI package.

ADD COMMENT
0
Entering edit mode
5.7 years ago
neranjan ▴ 70
en_2=useMart("ensembl", host = "http://feb2014.archive.ensembl.org")
human_2 = useDataset("hsapiens_gene_ensembl",mart=en_2)

So as a last question to close this , is there a way to save this "mart" object for later use. (or as a reference for later analysis) ?

ADD COMMENT
0
Entering edit mode

mart<-useMart(biomart ="ensemble", dataset = "")

ADD REPLY
0
Entering edit mode

not the assignment operator !!!

What I meant is to save it as object for later use, so I can load it when I do need it. (since these versions change what they have with time)

So is there a way to save this object in my hard drive? (some sort of similar way as they do in TxDb objects)

samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(samplefile)
ADD REPLY
1
Entering edit mode

Save the R session:

save.image("Analysis.Rdata")

load it again another time:

load("Analysis.Rdata")
ADD REPLY
0
Entering edit mode

Thanks, I think i was looking at more of only saving the ensemble database, and I think it can be done using:

rat_ensembl = useDataset("rnorvegicus_gene_ensembl",mart=ensembl)
save(rat_ensembl, file = "rat_ensembl.RData")

Then loading using:

load("rat_ensembl.RData")
ADD REPLY

Login before adding your answer.

Traffic: 2020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6