Extracting human exome from GTF file
1
0
Entering edit mode
12 weeks ago
Eugenio • 0

Hi everyone.

I am trying to extract all unique exons from a GTF file using GRanges. This is the file I'm working on: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/

So far, this is what I have done:

# Import
raw_gtf <- import("gencode.v46.annotation.gtf", format = "gtf")

# Extract only exons
exons <- subset(raw_gtf, type == "exon")

# Merge overlapping sequences
reduced_exome <- reduce(exons)

# Total size is 160,758,297 ??
sum(reduced_exome@ranges@width)

I'd expect the size to be around 30Mb. Why is it so much larger?

gtf human exons • 653 views
ADD COMMENT
1
Entering edit mode

add a filter for biotype=protein_coding ?

ADD REPLY
0
Entering edit mode

Hi Pierre, thanks for your answer.

I’ve tried removing the mitochondrial DNA and filtering for "protein_coding" gene types, but the sequence is still much longer than expected. The filtering should be correct, though…

# Extract only exons in standard chromosomes (excluded mitochondrial DNA)
standard_chromosomes <- paste0("chr", c(1:22, "X", "Y"))
raw_exons <- subset(raw_gtf, type == "exon" & gene_type == "protein_coding")
exons <- raw_exons[seqnames(raw_exons) %in% standard_chromosomes]

# Merge overlapping sequences
reduced_exome <- reduce(exons)

# The total length is 110,190,436
sum(reduced_exome@ranges@width)
ADD REPLY
2
Entering edit mode
12 weeks ago

I have been playing around with your data; it all comes down to which exons one selects.

For example you can filter for transcript_type== 'protein_coding' and you can filter for transcript_support_level=1

When I do that, my sum ends up being around 37M

ADD COMMENT
0
Entering edit mode

I see, thank you!

Although I still find it curious that by filtering this way

standard_chromosomes <- paste0("chr", c(1:22, "X", "Y"))
raw_exons <- subset(raw_gtf, type == "exon" & gene_type == "protein_coding" & transcript_type == "protein_coding" & transcript_support_level == 1)
exons <- raw_exons[seqnames(raw_exons) %in% standard_chromosomes]

I get this final size -> 63,337,002

reduced_exome <- reduce(exons)
sum(reduced_exome@ranges@width)
ADD REPLY
1
Entering edit mode

I am uncomfortable using some of these advanced libraries when I don't fully understand what is going on inside them. For example, I don't know how the reduce works in GRanges.

I used a different code since solving this problem does not need GRanges. First, I filtered each exon by its attributes to retain those of interest, then used the exon ID as the unique key to retain only one exon. Then, I summed up the length of these unique exons. I had the code written by an AI - but did not quite check it closely. I was just interested in where the discrepancy might originate - I suspect it must be some filtering - investigating each record's attributes I've learned about the transcript_type and transcript_support_level attributes and used those.

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6