Tutorial:Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor
4
69
Entering edit mode
11.3 years ago
Irsan ★ 7.8k

Hi all,

It took me a while to figure this out so I thought it might be useful to a few other people.

When you have used htseq-count on each of your RNA-seq'ed samples and have combined all of your samples in a read count matrix, you might want to turn the absolute read counts into FPKM at some time. Therefore you need the total amount (maybe TMM/quartile-normalized) of mapped reads for each sample (easy) and the total exonic length for each gene (tricky).

The problem is that a gene can have multiple transcripts where the exons of transcript 1 might be partially overlapping with the exons of transcript2. So simply adding up the length of all exons annotated for a specific gene overestimates the gene size while taking the length of just 1 of the transcripts underestimates. Using the GenomicFeatures package from Bioconductor you can easily handle this issue.

# First, import the GTF-file that you have also used as input for htseq-count
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("yourFile.gtf",format="gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
fpkm rna-seq bioconductor • 36k views
ADD COMMENT
3
Entering edit mode

Thanks a lot, save my time writing a perl script. :)

ADD REPLY
0
Entering edit mode

Nice to hear that it's useful to others

ADD REPLY
1
Entering edit mode

Thank you for this script !

ADD REPLY
1
Entering edit mode

Really Really tanks!!

ADD REPLY
0
Entering edit mode

you're welcome. maybe you should make your post a comment instead of an answer

ADD REPLY
1
Entering edit mode

Thanks so much, this saves me a lot of time!

ADD REPLY
1
Entering edit mode

Hi Irsan,

Thanks for a very helpful chunk of code! It solved two of my problems. First I got the length for calculating FPKM which I need later in my analysis, secondly it gave an idea how to quickly count the number of exons per gene which I also needed.

Just a quick comment. Instead of using lapply I simply run:

exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))

Which took only about 2 seconds and returned a data.frame which I could easly merge with the rest of my data.

ADD REPLY
0
Entering edit mode

updating comment error resolved after starting R new session.

Hi all,

I'm getting an error in this line:

exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'width': argument ".f" is missing, with no default

Did anyone encounter this ?

Best

ADD REPLY
0
Entering edit mode

Hi,

Did you by any chance resolve this? I'm very interested in getting this code to work as well.

Best, Asta

ADD REPLY
3
Entering edit mode

I think this error comes from R running "reduce" from dplyr (if you have it loaded) instead of from GenomicRanges as we want it to do. You can substitute "reduce" with "GenomicRanges::reduce" to fix it. So the command will be:

exonic.gene.sizes <- as.data.frame(sum(width(GenomicRanges::reduce(exons.list.per.gene))))
ADD REPLY
1
Entering edit mode

Using the code above For zebrafish, what you can do is:

#first get the gtf file from, lets say, ensembl
library(GenomicFeatures)
#the function above makeTranscriptDbFromGFF is now have different name, make sure that you type the correct one
txdb <- makeTxDbFromGFF("Danio_rerio.GRCz11.93.gtf",format="gtf")
head(txdb)
View(txdb)
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(txdb,by="gene")
head(exons.list.per.gene)
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})
head(exonic.gene.sizes)
#to see them in a table format, you should unlist them
unlist_geneLength<-unlist(exonic.gene.sizes)
write.table(unlist_geneLength,"geneLength2.txt")
ADD REPLY
0
Entering edit mode

The code above was working two days ago, but did not work yesterday. Here is the thing, I guess, one of the packages I have downloaded before GenomicFeatures prevented the code above to work properly (although there were no warning, package masking). Where do I know this from? Simple, today the code is working :). Please, be careful about this.

ADD REPLY
1
Entering edit mode

Hi! your code looks great! I want to calculate the and I got to this post via github page but I am confused. Is this code useful to calculate the length only or the meanFragmentLength?

Thank you!

ADD REPLY
0
Entering edit mode

Looks like we're on the same path

ADD REPLY
0
Entering edit mode

yeah but I am sure you know how to do it, whereas I am just go more confused :/

ADD REPLY
1
Entering edit mode

Nope, I kinda gave up. I got to the image on what fragment size is (What is the difference between a Read and a Fragment in RNA-seq?) but cannot figure out how to connect that to the picard metrics output. For the moment, I'd say use insert size for fragment length especially since you trimmed adapters using trimmomatic anyway.

ADD REPLY
0
Entering edit mode

Can different genes that share the same exons cause similar problem?

ADD REPLY
0
Entering edit mode

Because you reduce overlapping exons for each individual gene seperately, they will contribute to both gene sizes like the way you want

ADD REPLY
0
Entering edit mode

How about total non-overlapping introns per gene lengths? Introns are grouped by transcripts.How could I group them by genes? Thanks!

ADD REPLY
0
Entering edit mode

Oef, I have to look that up in the GenomicRanges manual. Their is an introns() accessor, that might bring you somewhere

ADD REPLY
0
Entering edit mode

Hi Irsan, How can you collect exons per gene_name, instead of gene id? I am using a gtf file Thank you

ADD REPLY
0
Entering edit mode

Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?

ADD REPLY
0
Entering edit mode

Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?

ADD REPLY
0
Entering edit mode

Yes, i am using annotation gtf file from Ensembl. I want to get gene symbol, instead of Ensembl gene ID. Do you know how? Thanks

ADD REPLY
0
Entering edit mode

Many ensembl gene ids do not have an official HGNC symbol (that is what you are looking for). I think the gene symbols are not even in the gtf-file. So you have to work with Ensembl gene IDs and at the end annotate them with gene symbol with for example bioconductor package org.Hs.eg.db

ADD REPLY
0
Entering edit mode

I used gtf file for human from here but the makeTranscriptDbFromGFF gave me the following error,

error in evaluating the argument 'x' in selecting a method for function 'unique': Error in subs[[i]] : subscript out of bounds

is there any suggestions? probably a point to another gtf file?

ADD REPLY
1
Entering edit mode

Download GTF-file from UCSC table browser

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
Nice it works for you right now. I still wonder what the problem is with the ensembl 75 gtf file. Have you tried removing the first few comment lines (starting with #) in the gtf?
ADD REPLY
0
Entering edit mode

I've tried this and it might not be the best idea. When creating a GTF-file, the UCSC table browser fills in the gene_ID field with the transcript_ID, no matter whether the UCSC, RefSeq or Ensembl track is used, which defeats the purpose of your script. I have no idea why UCSC would do this, seems like a very dangerous oversight on their part.

Thanks for the script though, it's saved me a lot of time.

ADD REPLY
0
Entering edit mode

The last part:

exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})

Seems to be taking extraordinarily long to run. Any faster method?

ADD REPLY
1
Entering edit mode

sum(width(reduce(exons.list.per.gene))) works directly and should be faster

ADD REPLY
0
Entering edit mode
it takes 30 minutes for ensGene table with one 2.67 Ghz cpu
ADD REPLY
0
Entering edit mode

Thank you so much for sharing the code! I am a beginner, just wondering if you could help with how to input the length information into the dds data. Many thanks!

ADD REPLY
0
Entering edit mode

Hi Irsan

Thank you very much for providing us with the results of your studies.

The study I am analyzing is a single-end sequence.

Can I use your code with these conditions?

ADD REPLY
0
Entering edit mode

Hi,

To reduce the time taken to upload the gtf file. Install using bioconductor the TxDb libraries:

# Gene lenght in bp
## (TxDb.Mmusculus.UCSC.mm10.knownGene)  Mus musculus
## (TxDb.Hsapiens.UCSC.hg19.knownGene) Homo sapiens
## (TxDb.Hsapiens.UCSC.hg38.knownGene) Homo sapiens
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
# add to count data the gene length. Only works with ENTREZID
data$gene_length<- exonic.gene.sizes[data$entrezid]

best

ADD REPLY
0
Entering edit mode

Would you recommend using the gencode comprehensive annotations file or the basic annotations file to do this since they produce different results (see here and here)? Here they use the comprehensive one.

ADD REPLY
6
Entering edit mode
9.8 years ago
Kamil ★ 2.3k

Here's another way to do it with Python instead of R:

#!/usr/bin/env python
"""
coding_lengths.py
Kamil Slowikowski
February 7, 2014
Count the number of coding base pairs in each Gencode gene.
Gencode coordinates, including all exons with Ensembl identifiers.
(Gencode release 17 corresponds to hg19)
ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
ftp://ftp.sanger.ac.uk/pub/gencode/release_17/gencode.v17.annotation.gtf.gz
chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.4";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.4";
NCBI mapping from Entrez GeneID to Ensembl identifiers.
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz
9606 1 ENSG00000121410 NM_130786.3 ENST00000263100 NP_570602.2 ENSP00000263100
9606 2 ENSG00000175899 NM_000014.4 ENST00000318602 NP_000005.2 ENSP00000323929
9606 3 ENSG00000256069 NR_040112.1 ENST00000543404 - -
9606 9 ENSG00000171428 NM_000662.5 ENST00000307719 NP_000653.3 ENSP00000307218
9606 9 ENSG00000171428 XM_005273679.1 ENST00000517492 XP_005273736.1 ENSP00000429407
Output:
Ensembl_gene_identifier GeneID length
ENSG00000000005 64102 1339
ENSG00000000419 8813 1185
ENSG00000000457 57147 3755
ENSG00000000938 2268 3167
USAGE:
coding_lengths.py -g FILE -n FILE [-o FILE]
OPTIONS:
-h Show this help message.
-g FILE Gencode annotation.gtf.gz file.
-n FILE NCBI gene2ensembl.gz file.
-o FILE Output file (gzipped).
"""
import GTF # https://gist.github.com/slowkow/8101481
from docopt import docopt
import pandas as pd
import gzip
import time
import sys
from contextlib import contextmanager
def main(args):
# Input files.
GENCODE = args['-g']
NCBI_ENSEMBL = args['-n']
# Output file prefix.
GENE_LENGTHS = args['-o'] or "ncbi_ensembl_coding_lengths.txt.gz"
with log("Reading the Gencode annotation file: {}".format(GENCODE)):
gc = GTF.dataframe(GENCODE)
# Select just exons of protein coding genes, and columns that we want to use.
idx = (gc.feature == 'exon') & (gc.transcript_type == 'protein_coding')
exon = gc.ix[idx, ['seqname','start','end','gene_id','gene_name']]
# Convert columns to proper types.
exon.start = exon.start.astype(int)
exon.end = exon.end.astype(int)
# Sort in place.
exon.sort(['seqname','start','end'], inplace=True)
# Group the rows by the Ensembl gene identifier (with version numbers.)
groups = exon.groupby('gene_id')
with log("Calculating coding region (exonic) length for each gene..."):
lengths = groups.apply(count_bp)
with log("Reading NCBI mapping of Entrez GeneID "\
"to Ensembl gene identifier: {}".format(NCBI_ENSEMBL)):
g2e = pd.read_table(NCBI_ENSEMBL,
compression="gzip",
header=None,
names=['tax_id', 'GeneID',
'Ensembl_gene_identifier',
'RNA_nucleotide_accession.version',
'Ensembl_rna_identifier',
'protein_accession.version',
'Ensembl_protein_identifier'])
# Create a new DataFrame with gene lengths and EnsemblID.
ensembl_no_version = lengths.index.map(lambda x: x.split(".")[0])
ldf = pd.DataFrame({'length': lengths,
'Ensembl_gene_identifier': ensembl_no_version},
index=lengths.index)
# Merge so we have EntrezGeneID with length.
m1 = pd.merge(ldf, g2e, on='Ensembl_gene_identifier')
m1 = m1[['Ensembl_gene_identifier', 'GeneID', 'length']].drop_duplicates()
with log("Writing output file: {}".format(GENE_LENGTHS)):
with gzip.open(GENE_LENGTHS, "wb") as out:
m1.to_csv(out, sep="\t", index=False)
def count_bp(df):
"""Given a DataFrame with the exon coordinates from Gencode for a single
gene, return the total number of coding bases in that gene.
Example:
>>> import numpy as np
>>> n = 3
>>> r = lambda x: np.random.sample(x) * 10
>>> d = pd.DataFrame([np.sort([a,b]) for a,b in zip(r(n), r(n))], columns=['start','end']).astype(int)
>>> d
start end
0 6 9
1 3 4
2 4 9
>>> count_bp(d)
7
Here is a visual representation of the 3 exons and the way they are added:
123456789 Length
0 ---- 4
1 -- 2
2 ------ 6
======= 7
"""
start = df.start.min()
end = df.end.max()
bp = [False] * (end - start + 1)
for i in range(df.shape[0]):
s = df.iloc[i]['start'] - start
e = df.iloc[i]['end'] - start + 1
bp[s:e] = [True] * (e - s)
return sum(bp)
@contextmanager
def log(message):
"""Log a timestamp, a message, and the elapsed time to stderr."""
start = time.time()
sys.stderr.write("{} # {}\n".format(time.asctime(), message))
yield
elapsed = int(time.time() - start + 0.5)
sys.stderr.write("{} # done in {} s\n".format(time.asctime(), elapsed))
sys.stderr.flush()
if __name__ == '__main__':
args = docopt(__doc__)
main(args)
#!/usr/bin/env Rscript
# plot_coding_lengths.R
# Kamil Slowikowski
# April 17, 2014
d = read.delim("ncbi_ensembl_coding_lengths.txt.gz")
png("coding_lengths.png")
plot(density(log10(d$length)),
main="Gencode v19 lengths of coding regions by gene")
dev.off()

ADD COMMENT
0
Entering edit mode

Thanks for this. Can I just triple-check that the code "coding_lengths.py" is returning the total length of all of the exons (I am trying to calculate RPKM, so I think this is perfect to calculate gene lengths for that purpose).

And then just as a side note, is this data available for other species? I'm on the FTP site for "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz" now, and I can seeing anything resembling a folder for other species? Similar for the Sanger FTP site, I can see human and mouse; my work involves human mouse and rat, I'm wondering if this script would work on all of these species, and if not, if someone could suggest an alternative method?

ADD REPLY
1
Entering edit mode

I updated the comment in count_bp with a visual example of how exons are counted.

You can find gene annotations in GTF format for many species here:

ftp://ftp.ensembl.org/pub/release-87/gtf/

Note that the version (87 at the time of this post) is continually updated over time.

Regarding converting Ensembl ID to Entrez ID or vice versa, I recommend you use Ensembl Biomart.

ADD REPLY
0
Entering edit mode

Thanks. I actually tried to run the script using exactly as above and I ran into an error. In the attached picture: picture, in the purple section I have shown the first few files of the .gtf file, and in the orange section the gene2ensembl file, just to show that I've just downloaded these exactly as you said. Then the green section I have run the code.

The code is also here:

python coding_lengths.py -g gencode.v19.annotation.gtf -n gene2ensembl -o output

Wed Jan 18 11:30:46 2017 # Reading the Gencode annotation file: gencode.v19.annotation.gtf
Wed Jan 18 11:34:36 2017 # done in 230 s
coding_lengths.py:81: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  exon.sort(['seqname','start','end'], inplace=True)
Wed Jan 18 11:34:39 2017 # Calculating coding region (exonic) length for each gene...
Wed Jan 18 11:38:18 2017 # done in 219 s
Wed Jan 18 11:38:18 2017 # Reading NCBI mapping of Entrez GeneID to Ensembl gene identifier: gene2ensembl
Traceback (most recent call last):
  File "coding_lengths.py", line 163, in <module>
    main(args)
  File "coding_lengths.py", line 99, in main
    'Ensembl_protein_identifier'])
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4948)
  File "pandas/parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas/parser.c:7386)
  File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
  File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

I know it is not your job to solve my system errors. My only question is; to a more experienced coder, does this look like a problem with the script itself, or does it look like a problem on my end (i.e. with the way packages are installed or something)?

Thanks

ADD REPLY
1
Entering edit mode

My script expects to read a file compressed with gzip, but you provided a decompressed file.

Try running gzip gene2ensembl and then run the script again.

ADD REPLY
0
Entering edit mode

Thank you, you were so helpful, the script is absolutely fantastic.

ADD REPLY
0
Entering edit mode

Just a small note to everyone, the Ensembl GTF files are not the same as the NCBI Gencode GTF files. For this script, the only difference you seem to need to make is in the coding_lengths.py script, change line 73 to:

idx = (gc.feature == 'exon') & (gc.transcript_biotype == 'protein_coding')

(so you're changing gc.transcript_type to gc.transcript.biotype).

ADD REPLY
2
Entering edit mode
7.0 years ago
lhdcsu ▴ 60

Recommend the GTFtools package at: http://www.genemine.org/gtftools.php.

Using the '-l' options, four types of gene lengths will be calculated. One of them is the non-overlapping exon length (merge all exons of multiple isoforms of the same gene).

ADD COMMENT
0
Entering edit mode
9.8 years ago
yc1123 • 0

Thanks for the scripts. But when I run the 2nd line "exons.list.per.gene", I get "error, RS-DBI driver: (expired SQLiteConnection)". Dose anyone know why?

ADD COMMENT
0
Entering edit mode
14 months ago
alejandrogzi ▴ 140

Hi! in case anyone ends here looking to calculate non-overlapping exon lengths, I recently develop this: https://github.com/alejandrogzi/noel, an extremely fast non-overlapping exon length calculator written in Rust.

This program outperforms all the script/tools already mentioned in this discussion, check the tool post here: NOEL: An extremely fast Non-Overlapping Exon Length calculator written in Rust.

ADD COMMENT

Login before adding your answer.

Traffic: 3620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6