Hi everyone.
I am trying to extract all unique exons from a GTF file using GRanges. This is the file I'm working on: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/
So far, this is what I have done:
# Import
raw_gtf <- import("gencode.v46.annotation.gtf", format = "gtf")
# Extract only exons
exons <- subset(raw_gtf, type == "exon")
# Merge overlapping sequences
reduced_exome <- reduce(exons)
# Total size is 160,758,297 ??
sum(reduced_exome@ranges@width)
I'd expect the size to be around 30Mb. Why is it so much larger?
add a filter for biotype=protein_coding ?
Hi Pierre, thanks for your answer.
I’ve tried removing the mitochondrial DNA and filtering for "protein_coding" gene types, but the sequence is still much longer than expected. The filtering should be correct, though…