Extracting exons using GenomicFeatures is different from manual extraction
2
1
Entering edit mode
17 months ago
gernophil ▴ 90

If I try to extract the length of all exons (also those overlapping) using the GenomicFeatures R package with this code and this gencode file

library(GenomicFeatures)
txdb <- makeTxDbFromGFF("tables/gencode.v43.basic.annotation.gtf.gz",
                        format = "gtf")
exons.list.per.gene <- exonsBy(txdb, by = "gene")
sort(width(exons.list.per.gene)[["ENSG00000000003.15"]])

The result is

[1]   75   84   99  108  135  189  189  199  312 2878 2879

(11 exons).

If I do it manually (with cleaning the file first using gzcat gencode.v43.basic.annotation.gff3.gz | grep -v # > gencode.v43.basic.annotation.clean.gff)

library(data.table)
gencode_file_gff3 <- fread("tables/gencode.v43.basic.annotation.clean.gff",
                           header = FALSE)
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"] # filtering for exons
gencode_file_gff3[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)] # extractin ENSG number
gencode_file_gff3 <- gencode_file_gff3[, c("V4", "V5", "gene_id")] # reducing to start, end and gene_id
gencode_file_gff3[, length := V5 - (V4 - 1)] # calculating length
gencode_file_gff3 <- unique(gencode_file_gff3) # removing identical exons from HAVANA and ENSEMBL
sort(gencode_file_gff3[gene_id == "ENSG00000000003.15"]$length)

The result is

[1]   75   84   99  108  135  189  199  312 2878 2879

(10 exons). Without the unique command I retain 15 exons. So both is different from the Package. I can't figure out where this additional exon comes from. The unique command is necessary if two exons from HAVANA and ENSEMBL are exactly identical. I am not able to reproduce these results manually.

This is just exemplary, but I get slightly different numbers for most genes.

R GenomicFeatures • 1.3k views
ADD COMMENT
1
Entering edit mode

In case anyone is interested, I also asked on Github an their answer was: https://github.com/Bioconductor/GenomicFeatures/issues/60 Short: GenomicFeatures uses Ensembl exon ids for filtering (ENSE number). Some Exons are identical based on their genomic location, but not on their ENSE number. So if we take all exons using GenomicFeatures, we get more then with my code. If we would only take the non overlapping parts (i.e. the unique combined genomic regions), this effect should be gone.

ADD REPLY
0
Entering edit mode
17 months ago
gernophil ▴ 90

I made a script, that completely resembles the results from GenomicFeatures:

library(data.table)

gencode_file_gff3 <- fread("gzcat tables/gencode.v43.basic.annotation.gff3.gz | grep -v '^#'")
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"]

gencode_file_gff3_id <- copy(gencode_file_gff3)
gencode_file_gff3_symbol <- copy(gencode_file_gff3)

# per gene_id
gencode_file_gff3_id[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)]
gencode_file_gff3_id[, gene_id := ifelse((V1 == "chrY") &
                                           is.element(gene_id,
                                                      gencode_file_gff3_id[V1 != "chrY"]$gene_id),
                                         paste0(gene_id, "_PAR_Y"), gene_id)]
gencode_file_gff3_id[, bases := Map(seq, V4, V5)]
gencode_file_gff3_id[, count_unique_bases := uniqueN(unlist(bases)), by = gene_id] # alternatively use: gencode_file_gff3_id[, count_unique_bases := length(unique(unlist(bases))), by = gene_id]
gencode_file_gff3_id <- gencode_file_gff3_id[, c("gene_id", "count_unique_bases")]
gencode_file_gff3_id <- unique(gencode_file_gff3_id)


# per gene_symbol
gencode_file_gff3_symbol[, gene_symbol := gsub(".*gene_name=(.*);transcript_type.*", "\\1", V9)]
gencode_file_gff3_symbol[, gene_symbol := ifelse((V1 == "chrY") &
                                           is.element(gene_symbol,
                                                      gencode_file_gff3_symbol[V1 != "chrY"]$gene_symbol),
                                         paste0(gene_symbol, "_PAR_Y"), gene_symbol)]
gencode_file_gff3_symbol[, bases := Map(seq, V4, V5)]
gencode_file_gff3_symbol[, count_unique_bases := uniqueN(unlist(bases)), by = gene_symbol] # alternatively use: gencode_file_gff3_symbol[, count_unique_bases := length(unique(unlist(bases))), by = gene_symbol]
gencode_file_gff3_symbol <- gencode_file_gff3_symbol[, c("gene_symbol", "count_unique_bases")]
gencode_file_gff3_symbol <- unique(gencode_file_gff3_symbol)

if you use a different features then exon from the GENCODE file, you might need to edit the parts with gsub to get the correct info.

ADD COMMENT
0
Entering edit mode
12 months ago
alejandrogzi ▴ 140

Hi! in case anyone ends here looking to calculate non-overlapping exon lengths, I recently develop this: https://github.com/alejandrogzi/noel, an extremely fast non-overlapping exon length calculator written in Rust. Outperforms all the script/tools, check the tool post here: NOEL: An extremely fast Non-Overlapping Exon Length calculator written in Rust

ADD COMMENT

Login before adding your answer.

Traffic: 1989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6