Question

Differences in exon number from Biopython Entrez Package and ANNOVAR

0

Entering edit mode

6.5 years ago

gero.knittel ▴ 10

Hi everybody,

I have a strange problem, which for sure has something to do with the version of some reference that I'm using.

I'm counting the number of exons of the transcript NM_003752 in the following way:

from Bio import Entrez
Entrez.email = 'xxxxxx'

handle = Entrez.efetch(db='Nucleotide', id='NM_003752', retmode='xml')
record = Entrez.read(handle)
feature_count = len(record[0]['GBSeq_feature-table'])
exoncount = 0
for i in range(0, feature_count):
    if record[0]['GBSeq_feature-table'][i]['GBFeature_key'] == 'exon':
        exoncount += 1
print(exoncount)

Output is 21. Also, if I count manually on the NCBI website (https://www.ncbi.nlm.nih.gov/nuccore/557129041/), I count 21 exons.

However, if I run ANNOVAR using the following input file:

16  28391160    28391160    G   A

with the following command

annotate_variation.pl -out results -build hg19 annovar.avinput humandb/ --separate

I get this result:

line1   nonsynonymous SNV EIF3CL:NM_001317857:exon21:c.C2704T:p.R902C,EIF3C:NM_001286478:exon22:c.C2671T:p.R891C,EIF3CL:NM_001099661:exon21:c.C2704T:p.R902C,EIF3C:NM_001267574:exon21:c.C2701T:p.R901C,EIF3C:NM_001037808:exon22:c.C2701T:p.R901C,EIF3CL:NM_001317856:exon21:c.C2704T:p.R902C,EIF3C:NM_001199142:exon22:c.C2701T:p.R901C,EIF3C:NM_003752:exon22:c.C2701T:p.R901C,  16  28391160    28391160    G   A

So it says that for transcript NM_003752, the resulting mutation will be in exon 22, which it shouldn't even have.

What's the problem here?

Thanks very much for helping out!

Best, Gero

annovar entrez package biopython • 1.9k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 88k • written 6.5 years ago by gero.knittel ▴ 10

1

Entering edit mode

@gero: I removed your email address from the post.

ADD REPLY • link 6.5 years ago by GenoMax 148k

0

Entering edit mode

Thank you, that was not very considerate of me.

ADD REPLY • link 6.5 years ago by gero.knittel ▴ 10

0

Entering edit mode

gero.knittel : You should email ANNOVAR authors and let them know if this discrepancy.

ADD REPLY • link 6.5 years ago by GenoMax 148k

score 0 · Answer 1 · 2018-06-20

0

Entering edit mode

6.5 years ago

Kevin Blighe 88k

Hello and good evening,

ANNOVAR is correct. For that particular transcript isoform, NM_001286478, it does indeed have 22 exons. The 'split' occurs at exon 9 of the gene. In the following screenshot from the UCSC Genome Browser, this transcript is highlighted in blue (at left) and you can see that, in exon 9, a few bases are not transcribed in that and other transcripts:

[source: https://genome-euro.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtM...]

Kevin

ADD COMMENT • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Kevin Blighe : OP is asking about NM_003752:exon22:c.C2701T:p.R901C which is at the end of that long line.

It could be an error with ANNOVAR's annotation correct?

ADD REPLY • link 6.5 years ago by GenoMax 148k

0

Entering edit mode

Indeed, I was looking at the incorrect transcript. NM_001286478 (and a few other transcripts) have 22 exons, whereas NM_003752, in which Gero is interested, only has 21 exons in all transcripts that I have observed.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

I have just checked my own version of ANNOVAR (Version: $Date: 2015-06-17 21:43:51 -0700 (Wed, 17 Jun 2015)), and it gives a different result for that transcript. It says exon 21, correctly:

line1   nonsynonymous SNV   EIF3C:NM_001037808:exon21:c.C2704T:p.R902C,EIF3CL:NM_001099661:exon21:c.C2704T:p.R902C,EIF3C:NM_001199142:exon21:c.C2704T:p.R902C,EIF3C:NM_001267574:exon20:c.C2704T:p.R902C,EIF3C:NM_001286478:exon21:c.C2674T:p.R892C,EIF3C:NM_003752:exon21:c.C2704T:p.R902C,    16  28391160    28391160    G   A

Which version of ANNOVAR are you using?

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

I'm using the current version (Version: $Date: 2018-04-16 00:43:31 -0400 (Mon, 16 Apr 2018)).

ADD REPLY • link 6.5 years ago by gero.knittel ▴ 10

0

Entering edit mode

What is the source for the annotation by ANNOVAR? At ensembl I cannot find any transcript with 22 exons. They all have 21 or less.

I'v seen it at least once that the data from UCSC were wrong. This leads to a problem in a custom target resequencing kit we let design by Agilent. We gave them the transcript number we liked to capture and they design the probes. For one exon, in one gene the coordinates were wrong at the end, so we could not capture this one. Investigating on that problem it turned out that UCSC had the wrong coordinates for that exon.

fin swimmer

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

well, IGV (v 2.4.10) shows it's exon 22 for transcript (for both hg19 and b37) (screenshot attached). But UCSC table data for NM_003752 (for grch37/ hg19) lists only 21 exons. Probably, this discrepancy is due to incorrect transcript version in annotation, for that build.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

Hm, I'm pretty sure this is some kind of version problem of the annotation source.

In my IGV (v 2.4.10) it's exon 21.

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

I think it is the problem with annotation source. I deleted earlier hg19.genome from IGV (genomes folder) and downloaded new one. This time, the transcript has 21 exons.

ADD REPLY • link 6.5 years ago by cpad0112 21k

0

Entering edit mode

In the UCSC screenshot that I showed, some also have 22 exons (I believe that they were 'UCSC genes'). Gene annotation is a constantly evolving process, as far as I understand.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Ladies and Gentlemen: what we are witnessing here is the complexity of the genome at play... transcription is pervasive and constant, and performed in ways we don't fully grasp in our daily research activities.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k