Some CDS are not a multiple of 3. How can this be? Example: CDS for gene ENSMUSG00000076764 begins with ATG and ends in TGA, as it should, but has length = 338 which is not divisible by 3. What's going on here?
Many CDS begin with one or two "N" nucleotides. How can this be? When I look up the given coordinates in UCSC genome browser, there is no "N" anywhere near these positions. example: CDS for gene ENSMUSG00000043241 begins with two N's. It is divisible by 3, so I assume the N's "belong" there. But why are they N? What does it mean?
1. It's ok not to have CDSs multiple of 3 as we sometimes are not able to annotated the codons fully. Depending on the biological evidence we have, we can annotate 1-2 bases only rather than the 3 bases that make up the codons. So these 1-2 bases will be hanging off in the end (or beginning) of our sequence (see this ENST00000589877). Also not all CDSs in Ensembl will begin at ATG and end at a stop codon (more on this thread). Your gene ENSMUSG00000076764 (TcR) is a tricky one. Ig and TcR genes are unusual: each full length gene is made up from a collection of gene segments linked together by recombination signal sequences. The HAVANA team have annotated the human counterparts of Ig/TcR genes in the past but not many of the mouse genes. This mouse gene was manually annotated by HAVANA today and its CDS will be this
Note the stop codon has not been found and the 3'end will be left open (represented by the X in the above sequence)
This new CDS will be available in our next release of Ensembl (due next week) and the GTF file should be corrected then.
2. You are right: there are no Ns in the reference genome just in the region upstream of the beginning of ENSMUST00000128200 (one of the transcripts of ENSMUSG00000043241). Instead we have AAT (see them in the Ensembl Location view): therefore these Ns have nothing to do with the genomic sequence itself. The reason why we have Ns there is because the piece of evidence we used to annotate ENSMUST00000128200 i.e. BC039278.1 does not extend further upstream of the sequence of the Ensembl transcript. So we use Ns to represent that. If we had the ATT there instead, we would have been predicting that sequence from the genomic sequence and in Ensembl we don't predict, we annotate models. We need the biological evidence (mRNA, EST, protein, RNASeq reads) to annotated our gene models.
Do you know that Ensembl has a helpdesk for this kind of questions ?
helpdesk [at] ensembl.org
I will investigate what is going on here and get back to you via the Ensembl Helpdesk. May also post the answer here at a later stage.