Question

The exon numbers of each transcript,

0

Entering edit mode

4.0 years ago

haasroni • 0

I am looking for information about the exons that compose each transcript of a specific gene. For example: transcript 1, exons: 1,2,5 transcript2, exons: 1,2,3,4,5,6,7 etc…

I tried to use a gtf table for that. In the table I found:

The exon_id column, which includes way more than the number of exons in this gene (about 200). So I guess that it gives more specific information about variation within exons.

The exon_number column. But it seems that the exons are numbered relatively to each transcript, and not relative to the gene (for example, for transcript 1 with the exons: 1,2,5; the exon numbers in the table would be 1,2,3)

Is there a way to find the exon numbers in each transcript, relative to the gene?

Thank you!

RNA exon • 4.4k views

ADD COMMENT • link 4.0 years ago by haasroni • 0

0

Entering edit mode

BiomaRt should return exons with their exon rank with respect to the transcript.

ADD REPLY • link 4.0 years ago by swbarnes2 15k

0

Entering edit mode

Thank you, with respect to the transcript means that the exon numbering will be similar to what I found in the gtf table. Isn't it?

ADD REPLY • link 4.0 years ago by haasroni • 0

score 0 · Answer 1 · 2021-07-16

0

Entering edit mode

4.0 years ago

Emily 24k

You can't get exon numbers relative to the gene, because exons are not completely discrete entities. Often two transcripts of a gene will have a similar exon that overlap but have different starts or ends. How would you number those?

ADD COMMENT • link 4.0 years ago by Emily 24k

0

Entering edit mode

Thanks, Emily. Of course, I agree. I meant to consider these exons that overlap but have different starts or ends as the same exons for this purpose.

ADD REPLY • link 4.0 years ago by haasroni • 0

score 0 · Answer 2 · 2021-07-16

Emily is right of course. But it doesn't mean you can't try anyway!

My conservative estimate is the all.uxons object this will likely underestimate the number of exons per gene. And if the gene has retained introns annotated as exons - it may erroneously collapse multiple exons. In general, you should get equal to or less exons than are known to exist for a gene.

My other estimate is the all.exons.unique object this will likely overestimate the number of exons per gene. Some gene are extensively annotated with exons that may not exist or may simple exist in non-standard biological conditions.

library(GenomicFeatures)

## using the Ensembl 101 release of the human transcriptome
txdb <- makeTxDbFromGFF('Homo_sapiens.GRCh38.101.gtf')

## get all exons
all.exons <- exonsBy(txdb,'gene')

## keep only unique exons per gene
all.exons.unique <- unique(all.exons)

## flatten all exons per gene
all.uxons <- reduce(all.exons)

## this is how the output looks
head(elementNROWS(all.uxons))
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
             10               7              10              15              32 

## number of genes where this is probably the accurate number of exons
sum(elementNROWS(all.uxons) == elementNROWS(all.exons.unique))
[1] 37091

## number of genes where there are still discrepancies
sum(elementNROWS(all.uxons) != elementNROWS(all.exons.unique))
[1] 23580

EDIT: I see you said by transcript... a bit of dubious question in and of itself. You could probably use the all.uxons object above and go through and re-number every known exon but you would collapse a lot of them. You could also try looking at exonsBy(txdb,'tx').