If I try to extract the length of all exons (also those overlapping) using the GenomicFeatures
R package with this code and this gencode file
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("tables/gencode.v43.basic.annotation.gtf.gz",
format = "gtf")
exons.list.per.gene <- exonsBy(txdb, by = "gene")
sort(width(exons.list.per.gene)[["ENSG00000000003.15"]])
The result is
[1] 75 84 99 108 135 189 189 199 312 2878 2879
(11 exons).
If I do it manually (with cleaning the file first using gzcat gencode.v43.basic.annotation.gff3.gz | grep -v # > gencode.v43.basic.annotation.clean.gff
)
library(data.table)
gencode_file_gff3 <- fread("tables/gencode.v43.basic.annotation.clean.gff",
header = FALSE)
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"] # filtering for exons
gencode_file_gff3[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)] # extractin ENSG number
gencode_file_gff3 <- gencode_file_gff3[, c("V4", "V5", "gene_id")] # reducing to start, end and gene_id
gencode_file_gff3[, length := V5 - (V4 - 1)] # calculating length
gencode_file_gff3 <- unique(gencode_file_gff3) # removing identical exons from HAVANA and ENSEMBL
sort(gencode_file_gff3[gene_id == "ENSG00000000003.15"]$length)
The result is
[1] 75 84 99 108 135 189 199 312 2878 2879
(10 exons). Without the unique command I retain 15 exons. So both is different from the Package. I can't figure out where this additional exon comes from. The unique command is necessary if two exons from HAVANA and ENSEMBL are exactly identical. I am not able to reproduce these results manually.
This is just exemplary, but I get slightly different numbers for most genes.
In case anyone is interested, I also asked on Github an their answer was: https://github.com/Bioconductor/GenomicFeatures/issues/60 Short: GenomicFeatures uses Ensembl exon ids for filtering (ENSE number). Some Exons are identical based on their genomic location, but not on their ENSE number. So if we take all exons using GenomicFeatures, we get more then with my code. If we would only take the non overlapping parts (i.e. the unique combined genomic regions), this effect should be gone.