Question

Why does GFF prepared using dexseq_prepare_annotation shows different exon number than UCSC genome browser

2

Entering edit mode

7.5 years ago

komal.rathi ★ 4.1k

Disclaimer: Tried to post this on bioconductor support but it wont allow me. I tried adding an entire paragraph in "English language" but no - still wouldn't allow me.

Hi everyone,

I am using DEXSeq for exon quantification. I ran dexseq_prepare_annotation to convert gencode v24 GTF to GFF like this:

python2.7 ~/path/to/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py gencode.v23.annotation.gtf gencode.v23.annotation.gff

For IDO2 which has gene id ENSG00000188676, I got 18 exonic parts in the GFF:

grep 'ENSG00000188676' gencode.v23.annotation.gff

chr8    dexseq_prepare_annotation.py    aggregate_gene    39934614    40016391    .    +    .    gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39934614    39934954    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "001"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39934955    39935218    .    +    .    transcripts "ENST00000343295.8+ENST00000502986.2"; exonic_part_number "002"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39949149    39949165    .    +    .    transcripts "ENST00000343295.8+ENST00000502986.2"; exonic_part_number "003"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39949166    39949264    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "004"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39963608    39963703    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "005"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39979067    39979186    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "006"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39982652    39982770    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "007"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39984951    39985507    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "008"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39985508    39985522    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "009"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39985523    39986460    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "010"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39986900    39987085    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "011"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39987743    39987870    .    +    .    transcripts "ENST00000418094.1"; exonic_part_number "012"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39987871    39987970    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "013"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39989721    39989838    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "014"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40005327    40005378    .    +    .    transcripts "ENST00000389060.8+ENST00000502986.2"; exonic_part_number "015"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40013565    40013713    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "016"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40015247    40015605    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "017"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40015606    40016391    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000502986.2"; exonic_part_number "018"; gene_id "ENSG00000188676.13"

However, when I go to UCSC genome browser, it shows that IDO2 has 10 exons. Why does my GFF show 18 exonic parts or is there an issue with the conversion?

dexseq_prepare_annotation dexseq exons • 2.8k views

ADD COMMENT • link updated 7.5 years ago by Devon Ryan 105k • written 7.5 years ago by komal.rathi ★ 4.1k

score 6 · Accepted Answer · 2017-09-26

An "exonic part" is simply a part of an exon, so there will be at least as many of them as there are exons. Take the following example of a gene with two isoforms:

####--------####----
####------####----##

Here, # is an exon and - is intronic or intergenic region. I'll merge all of those exons together and then illustrate where the exonic parts are:

####------######--##
1111------223344--55  (1-5 indicate that the base above belongs to that `exonic part`)

You can see that the exons are divided into disjoint sections, where each "part" is shared completely by all of the transcripts that contain it (compare exon 2 in both of the isoforms, which are only partially shared between them).