snpEff: wrong HGVS output?
1
0
Entering edit mode
6.2 years ago
Marvin ▴ 220

When I go to UCSC table browser and get the BED file for NM_000314.6 (gene = "PTEN"), I find out that the first exon starts at position 89623194 in GRCh37/hg19 (!). Thus the exon starts at 89623194+1 = 89623195 in a 1-based coordinate system.

Now I've created an artificial VCF file:

##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER
10  89623195        C   A   .   PASS

1 SNP at position 89623195, the first nucleotide of the first exon. Then I use the following snpEff command to annotate that SNP:

java -Xmx4g -jar ~/snpEff_latest_core/snpEff/snpEff.jar GRCh37.p13.RefSeq artificial.vcf

The ANN record that I get for NM_000314 is the following:

A|5_prime_UTR_variant|MODIFIER|PTEN|PTEN|transcript|NM_000314.4|protein_coding|1/9|c.-1032C>A|||||1032|

I want to focus on the HGVS nomenclature of that record: c.-1032C>A. If you navigate to the NCBI page of NM_000314 you will see that the coding sequence starts at position 1032 of the transcript. If I had introduced a SNP at position 1031, the expected HGVS nomenclature would be c.-1C>A. If you follow that logic, my introduced SNP at the first nucleotide of exon1 must have the HGVS nomenclature c.-1031C>A. Which is not the case. Is there an error in the snpEff algorithm?

snpeff hgvs • 2.9k views
ADD COMMENT
0
Entering edit mode

mutalyzer output for chr10:g.89623195C>A (NC_000010.11:g.89623195C>A) and GRCh38:

PANK1     : NM_138316.3:c.29-11147G>T
        NM_148978.2:c.29-11147G>T
        NM_148977.2:c.704-11147G>T
        XM_005269902.1:c.704-11147G>T
        XM_005269902.2:c.704-11147G>T

with GRCh37 (hg19):

PTEN      : LRG_311t1:c.-1032C>A
        NM_000314.4:c.-1032C>A
KLLN      : NM_001126049.1:c.-951G>T
LOC1019297: XR_246154.1:n.-1111C>A

I used UCSC hgtables function to export coordinates of NM_000314 using hg19 build, UCSC gene track and known genes track. As you see, CDS start is 89624226.

#name   chrom   strand  txStart txEnd   cdsStart    cdsEnd  exonCount   exonStarts
uc001kfb.3  chr10   +   89623194    89728532    89624226    89725229    9   89623194,89653781,89685269,89690802,89692769,89711874,89717609,89720650,89725043,
ADD REPLY
0
Entering edit mode

yes, everything as expected here, c.-1032 is the correct HGVS notation (see the comments below Kevins answer)

ADD REPLY
0
Entering edit mode

Hi Marvin,

There is no need to delete your question, especially when people have provided helpful reactions.

Cheers,
Wouter

ADD REPLY
2
Entering edit mode
6.2 years ago

I don't think so. Your variant is in the very first position of the 5' UTR of that PTEN transcript isoform. This 5' UTR is 1032 bases in length. HGVS names variants in the 5' UTR with respect to the first base of the first exon. Your variant is 1032 bases from the first exon, so, it is named c.-1032C>A

You should simulate variants at various positions in and around the 1 exon to see how snpEff names those.

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin, how do you retrieve the information that the 5' UTR is 1032 bases in length? If you navigate to the NM_000314 page at NCBI you will find that the coding sequence starts at Position 1032 (look for "CDS" Feature). Thus the 5' UTR is 1031 bases in length, not 1032.

ADD REPLY
0
Entering edit mode

I think I've found the solution: the annotation on the NCBI page of NM_000314 is just wrong! They say that the CDS Feature starts at position 1032 of the transcript. However, in the UCSC / ensembl / NCBI genome browser, the 5' UTR starts at position 89,623,195 and the coding sequence starts at Position 89,624,227. This means, that the coding sequence starts at the 89,624,227 - 89,623,194 = 1033rd position of the transcript. Not at the 1032nd as indicated by NCBI. Correct?

ADD REPLY
1
Entering edit mode

Again, I am not so sure. Firstly, the co-ordinates in the UCSC genome browser are 'corrected' to 1-base co-ordinates so as to be in line with other databases. Behind the scenes in the UCSC servers, the co-ordinates are stored as 0-based.

Your particular transcript, NM_000314, has an exon 1 that happens to have a relatively long 5' UTR of 1032 bases before the first CDS sequence of this exon is reached. Your variant is in the first base of this first exon's 5' UTR, and thus, it is annotated c.-1032C>A.

Some of the exon 1s of other PTEN transcript isoforms have much shorter 5' UTRs.

I found the length of the UTR for your transcript isoform by checking the details of the equivalent UCSC transcript isoform: uc001kfb.3

If you feel that there is an error, you may very well get int ouch with snpEff or the HGVS itself

ADD REPLY
0
Entering edit mode

You correctly state that the 5' UTR is 1032 bases in length. This means that the coding sequence MUST start at 1033, right? But according to NCBI (https://www.ncbi.nlm.nih.gov/nuccore/NM_000314) the coding sequence ("Feature" -> "CDS") starts at position 1032. This doesn't match your observation. And by now I think your observation ("5' UTR = 1032") is correct and NCBI is not correct. How can you be "not so sure". Either you're right, or they're right, right? :-)

ADD REPLY
0
Entering edit mode

In Bioinformatics, Sir, nobody can ever be 100% certain because there are so many bugs, and also due to the fact that most bioinformaticians are not trained to properly test their functions/code.

ADD REPLY
1
Entering edit mode

I think this is a case of "either A or B". I was hoping for a confirmation but it's okay since I'm fairly confident that the NCBI made a mistake. Which is possible since their employees are humans, too :-) Thanks for your help!

ADD REPLY

Login before adding your answer.

Traffic: 1566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6