Why does GFF prepared using dexseq_prepare_annotation shows different exon number than UCSC genome browser
1
2
Entering edit mode
7.2 years ago
komal.rathi ★ 4.1k

Disclaimer: Tried to post this on bioconductor support but it wont allow me. I tried adding an entire paragraph in "English language" but no - still wouldn't allow me.

Hi everyone,

I am using DEXSeq for exon quantification. I ran dexseq_prepare_annotation to convert gencode v24 GTF to GFF like this:

python2.7 ~/path/to/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py gencode.v23.annotation.gtf gencode.v23.annotation.gff

For IDO2 which has gene id ENSG00000188676, I got 18 exonic parts in the GFF:

grep 'ENSG00000188676' gencode.v23.annotation.gff

chr8    dexseq_prepare_annotation.py    aggregate_gene    39934614    40016391    .    +    .    gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39934614    39934954    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "001"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39934955    39935218    .    +    .    transcripts "ENST00000343295.8+ENST00000502986.2"; exonic_part_number "002"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39949149    39949165    .    +    .    transcripts "ENST00000343295.8+ENST00000502986.2"; exonic_part_number "003"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39949166    39949264    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "004"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39963608    39963703    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "005"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39979067    39979186    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "006"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39982652    39982770    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "007"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39984951    39985507    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "008"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39985508    39985522    .    +    .    transcripts "ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "009"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39985523    39986460    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "010"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39986900    39987085    .    +    .    transcripts "ENST00000343295.8"; exonic_part_number "011"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39987743    39987870    .    +    .    transcripts "ENST00000418094.1"; exonic_part_number "012"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39987871    39987970    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "013"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    39989721    39989838    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "014"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40005327    40005378    .    +    .    transcripts "ENST00000389060.8+ENST00000502986.2"; exonic_part_number "015"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40013565    40013713    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "016"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40015247    40015605    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000389060.8+ENST00000502986.2"; exonic_part_number "017"; gene_id "ENSG00000188676.13"
chr8    dexseq_prepare_annotation.py    exonic_part    40015606    40016391    .    +    .    transcripts "ENST00000418094.1+ENST00000343295.8+ENST00000502986.2"; exonic_part_number "018"; gene_id "ENSG00000188676.13"

However, when I go to UCSC genome browser, it shows that IDO2 has 10 exons. Why does my GFF show 18 exonic parts or is there an issue with the conversion?

dexseq_prepare_annotation dexseq exons • 2.6k views
ADD COMMENT
6
Entering edit mode
7.2 years ago

An "exonic part" is simply a part of an exon, so there will be at least as many of them as there are exons. Take the following example of a gene with two isoforms:

####--------####----
####------####----##

Here, # is an exon and - is intronic or intergenic region. I'll merge all of those exons together and then illustrate where the exonic parts are:

####------######--##
1111------223344--55  (1-5 indicate that the base above belongs to that `exonic part`)

You can see that the exons are divided into disjoint sections, where each "part" is shared completely by all of the transcripts that contain it (compare exon 2 in both of the isoforms, which are only partially shared between them).

ADD COMMENT
0
Entering edit mode

Thank you very much for the explanation!!

ADD REPLY

Login before adding your answer.

Traffic: 1942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6