Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor

Tutorial:Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor

Entering edit mode

11.5 years ago

Irsan ★ 7.8k

Hi all,

It took me a while to figure this out so I thought it might be useful to a few other people.

When you have used htseq-count on each of your RNA-seq'ed samples and have combined all of your samples in a read count matrix, you might want to turn the absolute read counts into FPKM at some time. Therefore you need the total amount (maybe TMM/quartile-normalized) of mapped reads for each sample (easy) and the total exonic length for each gene (tricky).

The problem is that a gene can have multiple transcripts where the exons of transcript 1 might be partially overlapping with the exons of transcript2. So simply adding up the length of all exons annotated for a specific gene overestimates the gene size while taking the length of just 1 of the transcripts underestimates. Using the GenomicFeatures package from Bioconductor you can easily handle this issue.

# First, import the GTF-file that you have also used as input for htseq-count
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("yourFile.gtf",format="gtf")
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(txdb,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))

fpkm rna-seq bioconductor • 38k views

ADD COMMENT • link updated 17 months ago by Ram 45k • written 11.5 years ago by Irsan ★ 7.8k

Entering edit mode

Thanks a lot, save my time writing a perl script. :)

ADD REPLY • link 11.4 years ago by yyaobo ▴ 30

Entering edit mode

Nice to hear that it's useful to others

ADD REPLY • link 11.4 years ago by Irsan ★ 7.8k

Entering edit mode

Thank you for this script !

ADD REPLY • link 11.3 years ago by variantcalling.biosolution ▴ 10

Entering edit mode

Really Really tanks!!

ADD REPLY • link 11.0 years ago by fusion.slope ▴ 250

Entering edit mode

you're welcome. maybe you should make your post a comment instead of an answer

ADD REPLY • link 11.0 years ago by Irsan ★ 7.8k

Entering edit mode

Thanks so much, this saves me a lot of time!

ADD REPLY • link 11.0 years ago by mararie ▴ 10

Entering edit mode

Hi Irsan,

Thanks for a very helpful chunk of code! It solved two of my problems. First I got the length for calculating FPKM which I need later in my analysis, secondly it gave an idea how to quickly count the number of exons per gene which I also needed.

Just a quick comment. Instead of using lapply I simply run:

exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))

Which took only about 2 seconds and returned a data.frame which I could easly merge with the rest of my data.

ADD REPLY • link 7.5 years ago by ggusiak ▴ 30

Entering edit mode

updating comment error resolved after starting R new session.

Hi all,

I'm getting an error in this line:

exonic.gene.sizes <- as.data.frame(sum(width(reduce(exons.list.per.gene))))

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': error in evaluating the argument 'x' in selecting a method for function 'width': argument ".f" is missing, with no default

Did anyone encounter this ?

Best

ADD REPLY • link 4.5 years ago by MatStat ▴ 160

Entering edit mode

Hi,

Did you by any chance resolve this? I'm very interested in getting this code to work as well.

Best, Asta

ADD REPLY • link 4.2 years ago by asta.tranholm • 0

Entering edit mode

I think this error comes from R running "reduce" from dplyr (if you have it loaded) instead of from GenomicRanges as we want it to do. You can substitute "reduce" with "GenomicRanges::reduce" to fix it. So the command will be:

exonic.gene.sizes <- as.data.frame(sum(width(GenomicRanges::reduce(exons.list.per.gene))))

ADD REPLY • link 2.4 years ago by Acadia ▴ 30

Entering edit mode

Using the code above For zebrafish, what you can do is:

#first get the gtf file from, lets say, ensembl
library(GenomicFeatures)
#the function above makeTranscriptDbFromGFF is now have different name, make sure that you type the correct one
txdb <- makeTxDbFromGFF("Danio_rerio.GRCz11.93.gtf",format="gtf")
head(txdb)
View(txdb)
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(txdb,by="gene")
head(exons.list.per.gene)
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})
head(exonic.gene.sizes)
#to see them in a table format, you should unlist them
unlist_geneLength<-unlist(exonic.gene.sizes)
write.table(unlist_geneLength,"geneLength2.txt")

ADD REPLY • link 6.4 years ago by realnewbie ▴ 30

Entering edit mode

The code above was working two days ago, but did not work yesterday. Here is the thing, I guess, one of the packages I have downloaded before GenomicFeatures prevented the code above to work properly (although there were no warning, package masking). Where do I know this from? Simple, today the code is working :). Please, be careful about this.

ADD REPLY • link 6.4 years ago by realnewbie ▴ 30

Entering edit mode

Hi! your code looks great! I want to calculate the and I got to this post via github page but I am confused. Is this code useful to calculate the length only or the meanFragmentLength?

Thank you!

ADD REPLY • link 19 months ago by camillab. ▴ 160

Entering edit mode

Looks like we're on the same path

ADD REPLY • link 19 months ago by Ram 45k

Entering edit mode

yeah but I am sure you know how to do it, whereas I am just go more confused :/

ADD REPLY • link 19 months ago by camillab. ▴ 160

Entering edit mode

Nope, I kinda gave up. I got to the image on what fragment size is (What is the difference between a Read and a Fragment in RNA-seq?) but cannot figure out how to connect that to the picard metrics output. For the moment, I'd say use insert size for fragment length especially since you trimmed adapters using trimmomatic anyway.

ADD REPLY • link 19 months ago by Ram 45k

Entering edit mode

Can different genes that share the same exons cause similar problem?

ADD REPLY • link 11.5 years ago by PoGibas 5.1k

Entering edit mode

Because you reduce overlapping exons for each individual gene seperately, they will contribute to both gene sizes like the way you want

ADD REPLY • link 11.5 years ago by Irsan ★ 7.8k

Entering edit mode

How about total non-overlapping introns per gene lengths? Introns are grouped by transcripts.How could I group them by genes? Thanks!

ADD REPLY • link 11.3 years ago by liux.bio ▴ 370

Entering edit mode

Oef, I have to look that up in the GenomicRanges manual. Their is an introns() accessor, that might bring you somewhere

ADD REPLY • link 11.3 years ago by Irsan ★ 7.8k

Entering edit mode

Hi Irsan, How can you collect exons per gene_name, instead of gene id? I am using a gtf file Thank you

ADD REPLY • link 11.2 years ago by hoangtv ▴ 30

Entering edit mode

Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?

ADD REPLY • link 11.2 years ago by Irsan ★ 7.8k

Entering edit mode

Do you mean gene symbol in stead of gene name? Like TP53/RB1/E2F3?

ADD REPLY • link 11.2 years ago by Irsan ★ 7.8k

Entering edit mode

Yes, i am using annotation gtf file from Ensembl. I want to get gene symbol, instead of Ensembl gene ID. Do you know how? Thanks

ADD REPLY • link 11.2 years ago by hoangtv ▴ 30

Entering edit mode

Many ensembl gene ids do not have an official HGNC symbol (that is what you are looking for). I think the gene symbols are not even in the gtf-file. So you have to work with Ensembl gene IDs and at the end annotate them with gene symbol with for example bioconductor package org.Hs.eg.db

ADD REPLY • link 11.2 years ago by Irsan ★ 7.8k

Entering edit mode

I used gtf file for human from here but the makeTranscriptDbFromGFF gave me the following error,

error in evaluating the argument 'x' in selecting a method for function 'unique': Error in subs[[i]] : subscript out of bounds

is there any suggestions? probably a point to another gtf file?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Quak ▴ 520

Entering edit mode

Download GTF-file from UCSC table browser

ADD REPLY • link 10.9 years ago by Irsan ★ 7.8k

Entering edit mode

didn't work - I decided using http://bioconductor.org/packages/2.10/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Quak ▴ 520

Entering edit mode

Nice it works for you right now. I still wonder what the problem is with the ensembl 75 gtf file. Have you tried removing the first few comment lines (starting with #) in the gtf?

ADD REPLY • link 10.9 years ago by Irsan ★ 7.8k

Entering edit mode

I've tried this and it might not be the best idea. When creating a GTF-file, the UCSC table browser fills in the gene_ID field with the transcript_ID, no matter whether the UCSC, RefSeq or Ensembl track is used, which defeats the purpose of your script. I have no idea why UCSC would do this, seems like a very dangerous oversight on their part.

Thanks for the script though, it's saved me a lot of time.

ADD REPLY • link 9.0 years ago by Matthieu Miossec ▴ 370

Entering edit mode

The last part:

exonic.gene.sizes <- lapply(exons.list.per.gene,function(x){sum(width(reduce(x)))})

Seems to be taking extraordinarily long to run. Any faster method?

ADD REPLY • link 10.8 years ago by essell22 ▴ 10

Entering edit mode

sum(width(reduce(exons.list.per.gene))) works directly and should be faster

ADD REPLY • link 9.5 years ago by pgp.martin ▴ 30

Entering edit mode

it takes 30 minutes for ensGene table with one 2.67 Ghz cpu

ADD REPLY • link 10.8 years ago by Irsan ★ 7.8k

Entering edit mode

Thank you so much for sharing the code! I am a beginner, just wondering if you could help with how to input the length information into the dds data. Many thanks!

ADD REPLY • link 6.9 years ago by shenwei1376 • 0

Entering edit mode

Hi Irsan

Thank you very much for providing us with the results of your studies.

The study I am analyzing is a single-end sequence.

Can I use your code with these conditions?

ADD REPLY • link 3.4 years ago by Maryam • 0

Entering edit mode

Hi,

To reduce the time taken to upload the gtf file. Install using bioconductor the TxDb libraries:

# Gene lenght in bp
## (TxDb.Mmusculus.UCSC.mm10.knownGene)  Mus musculus
## (TxDb.Hsapiens.UCSC.hg19.knownGene) Homo sapiens
## (TxDb.Hsapiens.UCSC.hg38.knownGene) Homo sapiens
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
# then collect the exons per gene id
exons.list.per.gene <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,by="gene")
# then for each gene, reduce all the exons to a set of non overlapping exons, calculate their lengths (widths) and sum then
exonic.gene.sizes <- sum(width(reduce(exons.list.per.gene)))
# add to count data the gene length. Only works with ENTREZID
data$gene_length<- exonic.gene.sizes[data$entrezid]

best

ADD REPLY • link updated 23 months ago by Ram 45k • written 24 months ago by YORDI SEBASTIÁN • 0

Entering edit mode

Would you recommend using the gencode comprehensive annotations file or the basic annotations file to do this since they produce different results (see here and here)? Here they use the comprehensive one.

ADD REPLY • link 22 months ago by gernophil ▴ 120

Entering edit mode

10.0 years ago

Kamil ★ 2.3k

Here's another way to do it with Python instead of R:

view raw coding_lengths.png hosted with ❤ by GitHub

	#!/usr/bin/env python
	"""
	coding_lengths.py
	Kamil Slowikowski
	February 7, 2014

	Count the number of coding base pairs in each Gencode gene.

	Gencode coordinates, including all exons with Ensembl identifiers.
	(Gencode release 17 corresponds to hg19)

	ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
	ftp://ftp.sanger.ac.uk/pub/gencode/release_17/gencode.v17.annotation.gtf.gz

	chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4";
	chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4";
	chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4";
	chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.4";
	chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.4";


	NCBI mapping from Entrez GeneID to Ensembl identifiers.

	ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz

	9606 1 ENSG00000121410 NM_130786.3 ENST00000263100 NP_570602.2 ENSP00000263100
	9606 2 ENSG00000175899 NM_000014.4 ENST00000318602 NP_000005.2 ENSP00000323929
	9606 3 ENSG00000256069 NR_040112.1 ENST00000543404 - -
	9606 9 ENSG00000171428 NM_000662.5 ENST00000307719 NP_000653.3 ENSP00000307218
	9606 9 ENSG00000171428 XM_005273679.1 ENST00000517492 XP_005273736.1 ENSP00000429407

	Output:

	Ensembl_gene_identifier GeneID length
	ENSG00000000005 64102 1339
	ENSG00000000419 8813 1185
	ENSG00000000457 57147 3755
	ENSG00000000938 2268 3167

	USAGE:
	coding_lengths.py -g FILE -n FILE [-o FILE]

	OPTIONS:
	-h Show this help message.
	-g FILE Gencode annotation.gtf.gz file.
	-n FILE NCBI gene2ensembl.gz file.
	-o FILE Output file (gzipped).
	"""


	import GTF # https://gist.github.com/slowkow/8101481

	from docopt import docopt
	import pandas as pd
	import gzip
	import time
	import sys
	from contextlib import contextmanager


	def main(args):
	# Input files.
	GENCODE = args['-g']
	NCBI_ENSEMBL = args['-n']

	# Output file prefix.
	GENE_LENGTHS = args['-o'] or "ncbi_ensembl_coding_lengths.txt.gz"

	with log("Reading the Gencode annotation file: {}".format(GENCODE)):
	gc = GTF.dataframe(GENCODE)

	# Select just exons of protein coding genes, and columns that we want to use.
	idx = (gc.feature == 'exon') & (gc.transcript_type == 'protein_coding')
	exon = gc.ix[idx, ['seqname','start','end','gene_id','gene_name']]

	# Convert columns to proper types.
	exon.start = exon.start.astype(int)
	exon.end = exon.end.astype(int)

	# Sort in place.
	exon.sort(['seqname','start','end'], inplace=True)

	# Group the rows by the Ensembl gene identifier (with version numbers.)
	groups = exon.groupby('gene_id')

	with log("Calculating coding region (exonic) length for each gene..."):
	lengths = groups.apply(count_bp)

	with log("Reading NCBI mapping of Entrez GeneID "\
	"to Ensembl gene identifier: {}".format(NCBI_ENSEMBL)):
	g2e = pd.read_table(NCBI_ENSEMBL,
	compression="gzip",
	header=None,
	names=['tax_id', 'GeneID',
	'Ensembl_gene_identifier',
	'RNA_nucleotide_accession.version',
	'Ensembl_rna_identifier',
	'protein_accession.version',
	'Ensembl_protein_identifier'])

	# Create a new DataFrame with gene lengths and EnsemblID.
	ensembl_no_version = lengths.index.map(lambda x: x.split(".")[0])
	ldf = pd.DataFrame({'length': lengths,
	'Ensembl_gene_identifier': ensembl_no_version},
	index=lengths.index)

	# Merge so we have EntrezGeneID with length.
	m1 = pd.merge(ldf, g2e, on='Ensembl_gene_identifier')
	m1 = m1[['Ensembl_gene_identifier', 'GeneID', 'length']].drop_duplicates()

	with log("Writing output file: {}".format(GENE_LENGTHS)):
	with gzip.open(GENE_LENGTHS, "wb") as out:
	m1.to_csv(out, sep="\t", index=False)


	def count_bp(df):
	"""Given a DataFrame with the exon coordinates from Gencode for a single
	gene, return the total number of coding bases in that gene.
	Example:
	>>> import numpy as np
	>>> n = 3
	>>> r = lambda x: np.random.sample(x) * 10
	>>> d = pd.DataFrame([np.sort([a,b]) for a,b in zip(r(n), r(n))], columns=['start','end']).astype(int)
	>>> d
	start end
	0 6 9
	1 3 4
	2 4 9
	>>> count_bp(d)
	7

	Here is a visual representation of the 3 exons and the way they are added:

	123456789 Length
	0 ---- 4
	1 -- 2
	2 ------ 6
	======= 7
	"""
	start = df.start.min()
	end = df.end.max()
	bp = [False] * (end - start + 1)
	for i in range(df.shape[0]):
	s = df.iloc[i]['start'] - start
	e = df.iloc[i]['end'] - start + 1
	bp[s:e] = [True] * (e - s)
	return sum(bp)


	@contextmanager
	def log(message):
	"""Log a timestamp, a message, and the elapsed time to stderr."""
	start = time.time()
	sys.stderr.write("{} # {}\n".format(time.asctime(), message))
	yield
	elapsed = int(time.time() - start + 0.5)
	sys.stderr.write("{} # done in {} s\n".format(time.asctime(), elapsed))
	sys.stderr.flush()


	if __name__ == '__main__':
	args = docopt(__doc__)
	main(args)

view raw coding_lengths.py hosted with ❤ by GitHub

	#!/usr/bin/env Rscript
	# plot_coding_lengths.R
	# Kamil Slowikowski
	# April 17, 2014
	d = read.delim("ncbi_ensembl_coding_lengths.txt.gz")
	png("coding_lengths.png")
	plot(density(log10(d$length)),
	main="Gencode v19 lengths of coding regions by gene")
	dev.off()

view raw plot_coding_lengths.R hosted with ❤ by GitHub

ADD COMMENT • link 10.0 years ago by Kamil ★ 2.3k

Entering edit mode

Thanks for this. Can I just triple-check that the code "coding_lengths.py" is returning the total length of all of the exons (I am trying to calculate RPKM, so I think this is perfect to calculate gene lengths for that purpose).

And then just as a side note, is this data available for other species? I'm on the FTP site for "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz" now, and I can seeing anything resembling a folder for other species? Similar for the Sanger FTP site, I can see human and mouse; my work involves human mouse and rat, I'm wondering if this script would work on all of these species, and if not, if someone could suggest an alternative method?

ADD REPLY • link 8.3 years ago by Tom ▴ 50

Entering edit mode

I updated the comment in count_bp with a visual example of how exons are counted.

You can find gene annotations in GTF format for many species here:

ftp://ftp.ensembl.org/pub/release-87/gtf/

Note that the version (87 at the time of this post) is continually updated over time.

Regarding converting Ensembl ID to Entrez ID or vice versa, I recommend you use Ensembl Biomart.

ADD REPLY • link 8.3 years ago by Kamil ★ 2.3k

Entering edit mode

Thanks. I actually tried to run the script using exactly as above and I ran into an error. In the attached picture: , in the purple section I have shown the first few files of the .gtf file, and in the orange section the gene2ensembl file, just to show that I've just downloaded these exactly as you said. Then the green section I have run the code.

The code is also here:

python coding_lengths.py -g gencode.v19.annotation.gtf -n gene2ensembl -o output

Wed Jan 18 11:30:46 2017 # Reading the Gencode annotation file: gencode.v19.annotation.gtf
Wed Jan 18 11:34:36 2017 # done in 230 s
coding_lengths.py:81: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  exon.sort(['seqname','start','end'], inplace=True)
Wed Jan 18 11:34:39 2017 # Calculating coding region (exonic) length for each gene...
Wed Jan 18 11:38:18 2017 # done in 219 s
Wed Jan 18 11:38:18 2017 # Reading NCBI mapping of Entrez GeneID to Ensembl gene identifier: gene2ensembl
Traceback (most recent call last):
  File "coding_lengths.py", line 163, in <module>
    main(args)
  File "coding_lengths.py", line 99, in main
    'Ensembl_protein_identifier'])
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 275, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 590, in __init__
    self._make_engine(self.engine)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 731, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/nis/aoife/env/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1103, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 515, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4948)
  File "pandas/parser.pyx", line 705, in pandas.parser.TextReader._get_header (pandas/parser.c:7386)
  File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
  File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

I know it is not your job to solve my system errors. My only question is; to a more experienced coder, does this look like a problem with the script itself, or does it look like a problem on my end (i.e. with the way packages are installed or something)?

Thanks

ADD REPLY • link 8.3 years ago by Tom ▴ 50

Entering edit mode

My script expects to read a file compressed with gzip, but you provided a decompressed file.

Try running gzip gene2ensembl and then run the script again.

ADD REPLY • link 8.3 years ago by Kamil ★ 2.3k

Entering edit mode

Thank you, you were so helpful, the script is absolutely fantastic.

ADD REPLY • link 8.3 years ago by Tom ▴ 50

Entering edit mode

Just a small note to everyone, the Ensembl GTF files are not the same as the NCBI Gencode GTF files. For this script, the only difference you seem to need to make is in the coding_lengths.py script, change line 73 to:

idx = (gc.feature == 'exon') & (gc.transcript_biotype == 'protein_coding')

(so you're changing gc.transcript_type to gc.transcript.biotype).

ADD REPLY • link 8.3 years ago by Tom ▴ 50

Entering edit mode

7.2 years ago

lhdcsu ▴ 60

Recommend the GTFtools package at: http://www.genemine.org/gtftools.php.

Using the '-l' options, four types of gene lengths will be calculated. One of them is the non-overlapping exon length (merge all exons of multiple isoforms of the same gene).

ADD COMMENT • link 7.2 years ago by lhdcsu ▴ 60

Entering edit mode

10.0 years ago

yc1123 • 0

Thanks for the scripts. But when I run the 2nd line "exons.list.per.gene", I get "error, RS-DBI driver: (expired SQLiteConnection)". Dose anyone know why?

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 10.0 years ago by yc1123 • 0

Entering edit mode

17 months ago

alejandrogzi ▴ 150

Hi! in case anyone ends here looking to calculate non-overlapping exon lengths, I recently develop this: https://github.com/alejandrogzi/noel, an extremely fast non-overlapping exon length calculator written in Rust.

This program outperforms all the script/tools already mentioned in this discussion, check the tool post here: NOEL: An extremely fast Non-Overlapping Exon Length calculator written in Rust.

ADD COMMENT • link 17 months ago by alejandrogzi ▴ 150