Length of transcripts of mm9
2
0
Entering edit mode
8.5 years ago
qwzhang0601 ▴ 80

I have got raw count of mouse RNA reads. In order to calculate the FPKM, I obtained the length of transcripts (CDS + UTR) of the genes through BioMart. And then for each gene I selected the longest one. However, later I found the alignment was mapped to an old version of genome (mm9). But the information I got from BioMart is on a new version of genome (i.e., mm10). And I found from the the web site of Biomart, I can only achieve CDS length (but not transcript length) for a old ensemble corresponding to mm9. So I want to know (1) is there a relatively easy way to get the length of transcripts for mm9, given the gene ensemble ID?
(2) If there is no easy ways, is it good enough to use the transcript length from mm10 to calculated the FPKM? I think the gene length from different version of genome should be the same or quite similar, right?

Thanks

RNA-Seq Biomart FPKM • 6.0k views
ADD COMMENT
0
Entering edit mode

You can find mm9 genome in Ensembl release 67 here. BioMart from there should get you the information you need.

Was your data mapped to mm9 genome or transcriptome. Either would have different lengths and can't be used interchangeably.

ADD REPLY
0
Entering edit mode

Thank you. Yes, I can find the Ensemble release 67, but for this version of genome I can only get the Length of CDS not the transcripts.

ADD REPLY
0
Entering edit mode

Are you sure about that? I am seeing the usual options under "Attributes"/"Sequence".

ADD REPLY
0
Entering edit mode

Do you mean "Attributes"->"sequence"-> "cDNA sequences". And then calculate the length of the transcripts? I think it works. In the latest version of genome, I can directly get such information by choosing "Attributes"-> "Features"->"Transcript length (including UTRs and CDS)". But Thanks for your suggestions.

ADD REPLY
0
Entering edit mode

the raw count of reads were obtained by HTSeq. I think it must use the transcript annotation for the alignment. Thanks

ADD REPLY
0
Entering edit mode

Not many use FPKM any more. Are you looking to do DE analysis? If so you can use the raw counts you have with a package like DESeq2.

ADD REPLY
0
Entering edit mode

Yes. I will use DESeq2 to do find DE genes. However, I want to generate a heat map showing expression levels of genes under different conditions. I think the FPKM show relative expression level of genes. Besides, before I do the DE analysis I want to filter out unexpressed genes (i.e., FPKM<1) under both conditions. Thanks for your help.

ADD REPLY
0
Entering edit mode

Please do not send identical questions to BioStars and Ensembl helpdesk (twice).

ADD REPLY
0
Entering edit mode

I see. Sorry for that!

ADD REPLY
0
Entering edit mode
8.5 years ago
ATRX ★ 1.1k

You may use this code; however you need GTF mm10 as an input file.

## Transcript Length
library(GenomicRanges)
library(rtracklayer)

## reading your mm10 gtf file and read only exons
GTFfile <- "mm10,gtf" ## your mm10 gtf file
GTF <- import.gff(GTFfile, format="gtf", genome="mm10", asRangedData=F, feature.type="exon") 
grl <- reduce(split(GTF, elementMetadata(GTF)$gene_id))
reducedGTF <- unlist(grl, use.names=T)
elementMetadata(reducedGTF)$gene_id <- rep(names(grl), elementLengths(grl))
elementMetadata(reducedGTF)$widths <- width(reducedGTF)

## calculation of length
calc_length <- function(x) {
  sum(elementMetadata(x)$widths)
}
output <- t(sapply(split(reducedGTF, elementMetadata(reducedGTF)$gene_id), calc_length))
output <- data.frame(t(output))
output$id = rownames(output)
output$id = gsub(pattern = "(.*)[.](.*)", replacement = "\\1", x = output$id)
output = output[,c(2,1)]
colnames(output)[2] = "Transcript.Length"
head(output)
ADD COMMENT
0
Entering edit mode

Thank you. But would you please briefly explain the code? Does this code return length of each transcript of a gene? Or it returns a length for each gene (the longest or the union of transcripts)?

Thanks

ADD REPLY
0
Entering edit mode

This will give you the length of the transcript using GTF/GFF file. Here is what the code does:

  1. import.gff reads the file using the exon regions and creates IRange object
  2. reduce is used for combining all the exons based on a particular gene id
  3. calculate the length of the entire exon region using the width that is obtained using reduce function

Try running the code and check the results and you will see what I am doing.

ADD REPLY
0
Entering edit mode

I downloaded the mmg.gtf from "ftp://ftp.ensembl.org/pub/release-67/gtf/mus_musculus/". But when I run the following code, I got an error. And there are warnings? Do you have any suggestions? Thanks.

GTFfile <- "mm9.gtf" 

GTF <- import.gff(GTFfile, format="gtf", genome="mm9", asRangedData=F, feature.type="exon")

Error in validObject(.Object) : 
  invalid class “GRanges” object: 'seqnames' contains missing values
In addition: There were 16 warnings (use warnings() to see them)
> warnings() 
Warning messages:
1: 'BiocGenerics:::updateS4' is deprecated.
Use 'replaceSlots' instead.
See help("Deprecated")
2: 'BiocGenerics:::updateS4' is deprecated.
3: ......
ADD REPLY
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6