Question

Length(Ucsc/Ensgene) % 3 != 0

1

Entering edit mode

14.9 years ago

Pierre Lindenbaum 166k

Before I ask the question to the UCSC mailing list: is it me or something else ?

I've noticed that some records (not all) in hg18/UCSC ensGene.txt coding for a protein have a size where length%3!=0

For example for http://genome.ucsc.edu/cgi-bin/hgc?g=htcGeneMrna&i=ENST00000383614&c=chr6&l=30082338&r=30083996&o=ensGene&table=ensGene

>ENST00000383614
ccccagacgccgacgatggggtcATGGCGCCCCGAACCCTCCTCCTGCTG
CTCTCGGGGACCCTGGCCCTGGCCGAGACCTGGGCGGCCCCCCCCAAGAC
ACACGTGACCCacccccctctctgaacatgaggcataa

echo -n ATGGCGCCCCGAACCCTCCTCCTGCTGCTCTCGGGGACCCTGGCCCTGGCCGAGACCTGGGCGGCCCCCCCCAAGACACACGTGACCC | wc -c
88

but 88%3!=0

is it an error from the UCSC or am I missing something ?

protein ucsc cdna translation sequence • 2.8k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 14.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

this sequence you posted has a stop codon in position 87 of the nucleotide seq (84 starting counting from 0).

ADD REPLY • link 14.9 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Could be an error from the UCSC: https://lists.soe.ucsc.edu/pipermail/genome/2010-May/022368.html

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

yes, that was an error "We have determined that the data as originally incorporated into the track was strangely annotated and Ensembl has since corrected the error. The track on our side will be updated (and this data corrected) at the next update"

ADD REPLY • link 14.9 years ago by Pierre Lindenbaum 166k

Ram · Answer 1 · 2010-05-25

2

Entering edit mode

14.9 years ago

Giovanni M Dall'Olio 28k

It may be an error in the annotation: there are many, I can assure you. A while ago, the ensembl's maintainer made disappear a gene that I was studying, as they merged its transcript with another gene.

Notice that the sequence you posted has a stop codon in position 87 of the nucleotide seq (84 if you start counting from 0).

By the way, the sequence you posted belong to a MHC chain, a gene which is well known for its variability and for generating a lot of transcripts.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.9 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Agree that this is a very variable region with several transcripts and a pseudogene.

ADD REPLY • link 14.9 years ago by Neilfws 49k

Ram · Answer 2 · 2010-05-25

2

Entering edit mode

14.9 years ago

Neilfws 49k

The same transcript at ensembl.org has length = 87 bp and a slightly different 3' sequence. I wonder if this is related to UCSC sequences having zero-based starts (i.e. first base = 0)?

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 14.9 years ago by Neilfws 49k

0

Entering edit mode

Forgot to add that this comes from the latest ensembl, whereas your data are from HG18.

ADD REPLY • link 14.9 years ago by Neilfws 49k