Question

Extracting human exome from GTF file

0

Entering edit mode

4 months ago

Eugenio • 0

Hi everyone.

I am trying to extract all unique exons from a GTF file using GRanges. This is the file I'm working on: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/

So far, this is what I have done:

# Import
raw_gtf <- import("gencode.v46.annotation.gtf", format = "gtf")

# Extract only exons
exons <- subset(raw_gtf, type == "exon")

# Merge overlapping sequences
reduced_exome <- reduce(exons)

# Total size is 160,758,297 ??
sum(reduced_exome@ranges@width)

I'd expect the size to be around 30Mb. Why is it so much larger?

gtf human exons • 738 views

ADD COMMENT • link updated 4 months ago by Istvan Albert 102k • written 4 months ago by Eugenio • 0

1

Entering edit mode

add a filter for biotype=protein_coding ?

ADD REPLY • link 4 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hi Pierre, thanks for your answer.

I’ve tried removing the mitochondrial DNA and filtering for "protein_coding" gene types, but the sequence is still much longer than expected. The filtering should be correct, though…

# Extract only exons in standard chromosomes (excluded mitochondrial DNA)
standard_chromosomes <- paste0("chr", c(1:22, "X", "Y"))
raw_exons <- subset(raw_gtf, type == "exon" & gene_type == "protein_coding")
exons <- raw_exons[seqnames(raw_exons) %in% standard_chromosomes]

# Merge overlapping sequences
reduced_exome <- reduce(exons)

# The total length is 110,190,436
sum(reduced_exome@ranges@width)

ADD REPLY • link updated 4 months ago by Ram 44k • written 4 months ago by Eugenio • 0

score 2 · Answer 1 · 2024-08-28

2

Entering edit mode

4 months ago

Istvan Albert 102k

I have been playing around with your data; it all comes down to which exons one selects.

For example you can filter for transcript_type== 'protein_coding' and you can filter for transcript_support_level=1

When I do that, my sum ends up being around 37M

ADD COMMENT • link 4 months ago by Istvan Albert 102k

0

Entering edit mode

I see, thank you!

Although I still find it curious that by filtering this way

standard_chromosomes <- paste0("chr", c(1:22, "X", "Y"))
raw_exons <- subset(raw_gtf, type == "exon" & gene_type == "protein_coding" & transcript_type == "protein_coding" & transcript_support_level == 1)
exons <- raw_exons[seqnames(raw_exons) %in% standard_chromosomes]

I get this final size -> 63,337,002

reduced_exome <- reduce(exons)
sum(reduced_exome@ranges@width)

ADD REPLY • link 4 months ago by Eugenio • 0

1

Entering edit mode

I am uncomfortable using some of these advanced libraries when I don't fully understand what is going on inside them. For example, I don't know how the reduce works in GRanges.

I used a different code since solving this problem does not need GRanges. First, I filtered each exon by its attributes to retain those of interest, then used the exon ID as the unique key to retain only one exon. Then, I summed up the length of these unique exons. I had the code written by an AI - but did not quite check it closely. I was just interested in where the discrepancy might originate - I suspect it must be some filtering - investigating each record's attributes I've learned about the transcript_type and transcript_support_level attributes and used those.

ADD REPLY • link 4 months ago by Istvan Albert 102k