Hi all,
It took me a while to figure this out so I thought it might be useful to a few other people.
When you have used htseq-count on each of your RNA-seq'ed samples and have combined all of your samples in a read count matrix, you might want to turn the absolute read counts into FPKM at some time. Therefore you need the total amount (maybe TMM/quartile-normalized) of mapped reads for each sample (easy) and the total exonic length for each gene (tricky).
The problem is that a gene can have multiple transcripts where the exons of transcript 1 might be partially overlapping with the exons of transcript2. So simply adding up the length of all exons annotated for a specific gene overestimates the gene size while taking the length of just 1 of the transcripts underestimates. Using the GenomicFeatures package from Bioconductor you can easily handle this issue.
# First, import the GTF-file that you have also used as input for htseq-count
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("yourFile.gtf",format="gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
Thanks a lot, save my time writing a perl script. :)
Nice to hear that it's useful to others
Thank you for this script !
Really Really tanks!!
you're welcome. maybe you should make your post a comment instead of an answer
Thanks so much, this saves me a lot of time!
Hi Irsan,
Thanks for a very helpful chunk of code! It solved two of my problems. First I got the length for calculating FPKM which I need later in my analysis, secondly it gave an idea how to quickly count the number of exons per gene which I also needed.
Just a quick comment. Instead of using lapply I simply run:
Which took only about 2 seconds and returned a data.frame which I could easly merge with the rest of my data.
updating comment error resolved after starting R new session.
Hi all,
I'm getting an error in this line:
Did anyone encounter this ?
Best
Hi,
Did you by any chance resolve this? I'm very interested in getting this code to work as well.
Best, Asta
I think this error comes from R running "reduce" from dplyr (if you have it loaded) instead of from GenomicRanges as we want it to do. You can substitute "reduce" with "GenomicRanges::reduce" to fix it. So the command will be:
Using the code above For zebrafish, what you can do is:
The code above was working two days ago, but did not work yesterday. Here is the thing, I guess, one of the packages I have downloaded before GenomicFeatures prevented the code above to work properly (although there were no warning, package masking). Where do I know this from? Simple, today the code is working :). Please, be careful about this.
Hi! your code looks great! I want to calculate the and I got to this post via github page but I am confused. Is this code useful to calculate the length only or the meanFragmentLength?
Thank you!
Looks like we're on the same path
yeah but I am sure you know how to do it, whereas I am just go more confused :/
Nope, I kinda gave up. I got to the image on what fragment size is (What is the difference between a Read and a Fragment in RNA-seq?) but cannot figure out how to connect that to the picard metrics output. For the moment, I'd say use insert size for fragment length especially since you trimmed adapters using trimmomatic anyway.
Can different genes that share the same exons cause similar problem?
Because you reduce overlapping exons for each individual gene seperately, they will contribute to both gene sizes like the way you want
How about total non-overlapping introns per gene lengths? Introns are grouped by transcripts.How could I group them by genes? Thanks!
Oef, I have to look that up in the GenomicRanges manual. Their is an introns() accessor, that might bring you somewhere
Hi Irsan, How can you collect exons per gene_name, instead of gene id? I am using a gtf file Thank you
Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?
Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?
Yes, i am using annotation gtf file from Ensembl. I want to get gene symbol, instead of Ensembl gene ID. Do you know how? Thanks
Many ensembl gene ids do not have an official HGNC symbol (that is what you are looking for). I think the gene symbols are not even in the gtf-file. So you have to work with Ensembl gene IDs and at the end annotate them with gene symbol with for example bioconductor package org.Hs.eg.db
I used gtf file for human from here but the
makeTranscriptDbFromGFF
gave me the following error,is there any suggestions? probably a point to another gtf file?
Download GTF-file from UCSC table browser
didn't work - I decided using http://bioconductor.org/packages/2.10/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html
I've tried this and it might not be the best idea. When creating a GTF-file, the UCSC table browser fills in the gene_ID field with the transcript_ID, no matter whether the UCSC, RefSeq or Ensembl track is used, which defeats the purpose of your script. I have no idea why UCSC would do this, seems like a very dangerous oversight on their part.
Thanks for the script though, it's saved me a lot of time.
The last part:
Seems to be taking extraordinarily long to run. Any faster method?
sum(width(reduce(exons.list.per.gene))) works directly and should be faster
Thank you so much for sharing the code! I am a beginner, just wondering if you could help with how to input the length information into the dds data. Many thanks!
Hi Irsan
Thank you very much for providing us with the results of your studies.
The study I am analyzing is a single-end sequence.
Can I use your code with these conditions?
Hi,
To reduce the time taken to upload the gtf file. Install using bioconductor the TxDb libraries:
best
Would you recommend using the gencode comprehensive annotations file or the basic annotations file to do this since they produce different results (see here and here)? Here they use the comprehensive one.