Whenever I annotate microarray probes or RNASeq reads and want to have information at the gene level, I deal with the following problem: In order to have a "clean" annotation, I don't want to consider any reads/probes that map to transcripts of more than one gene, and for the sake of "clean" statistics I don't want to consider any probes/values more than once in the analysis. To achieve this, it is essential to work with a non-redundant database. I usually use RefSeq, because this is the database that is most familiar to me.
In RefSeq, however, I found many exons (about 1500, downloaded from the UCSC refGene track) that are shared by transcripts with different gene symbols. About 500 "genes" seem to be affected by this kind of overlap.
Here are two extreme examples:
Duxbl1, Duxbl2 and Duxbl3 share all their exon junctions. Their ORFs (NP_001171009.1, NP_001171010.1, NP_899245.1) are identical.
Il11ra2 shares all its exon junctions with Gm2002 and Gm13305. Their ORFs (NP_001094066.1, NP_034680.3, NP_001092818.1) are identical.
Some of the genes that I found differ in their UTRs due to alternative transcription start sites or poly-A sites. Some of the genes also differ in their ORFs due to alternative splicing events. On the other hand, you can find many genes in RefSeq that have the same genesymbol but differ in their ORFs and UTRs (e.g. Nfkbid).
In Ensembl, only about 900 exons (downloaded from the UCSC ensGene track) are shared by transcripts with different ENSG IDs, which still affects about 500 genes. Ensembl states:
"An Ensembl gene (with a unique ENSG... ID) includes any spliced transcripts (ENST...) with overlapping coding sequence. (...) Transcript clusters with no overlapping coding sequence are annotated as separate genes."
This sounds very reasonable, but also here I found examples for inconsistencies: Palm2 (ENSMUSG00000090053) and Gm20459 (ENSMUSG00000089945) share coding exons. And what will happen when a gene has coding and non-coding isoforms?
My questions are:
(1) Would you recommend Ensembl over RefSeq to avoid/minimize problems of redundancy? (2) What is the most accepted definition of a gene when it comes to the question of information at the "gene level" for RNASeq and microaray data?
Thank you for your ideas and advice!
Johanna
Thank you, the helpdesk is a good idea. I will post the answer as soon as I have an explanation.
(Also sent from Ensembl helpdesk)
We haven't gone through every one of your 500, but we had a look at a few examples and found the same pattern, which should explain what's going on.
What's happening here is read-through transcripts. Let's look at the first example on your list: chr10:127759937-127760250(+) ENSMUST00000128247 ENSMUST00000073639 ENSMUSG00000056148 ENSMUSG00000089789
Here's the region of these genes: http://www.ensembl.org/Mus_musculus/Location/View?r=10%3A127759787-127792694
Here you can see what look like two distinct genes, Rdh1 (ENSMUSG00000089789) and Rdh9 (ENSMUSG00000056148). Spanning both of those genes is the transcript ENSMUST00000128247. This appears to share a 5' end with Rdh1 and the 3' end with Rdh9.
This transcript has been manually annotated by the Havana project based on biological evidence of its existence. The evidence can be seen here: http://www.ensembl.org/Mus_musculus/Transcript/SupportingEvidence?db=core;g=ENSMUSG00000056148;r=10:127759787-127792694;t=ENSMUST00000128247
However its existence does not mean that Rdh1 and Rdh9 are the same gene. Biologically, we still believe that they are two distinct genes, just transcription occasionally spans these two genes, known as readthrough transcription. For this reason we manually split these into two separate genes (although possibly we need three genes: the 5' gene, the 3' gene and the readthrough).
I agree that it could be clearer in our documentation that this is the case, and we will look into making it more explicit.
Here's the documentation from Havana, who do the manual annotation, on defining genes. There's a section on readthroughs on page 37. http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/assets/guidelines.pdf
In terms of non-coding overlaps, what we mean is that if just a little bit of the UTRs overlap, we won't consider them to be the same gene. For non-coding transcripts, what we're looking it is full exon sharing, rather than just sharing a small part. Our completely non-coding genes are manually curated (see page 18 of the Havana document above) which means that there's a bit of common sense that goes into it.
In terms of your microarray analysis, one way might be to compare to transcripts rather than genes. This way you can trace your transcripts back into gene, and this will differentiate it bit more.
Get back to me if you have any more questions about this.
Thank you for this quick and helpful reply to my questions! I was not aware of the read-through problem, and it seems that quite some examples on my list can be explained that way. Another category seems to be mRNAs that have an overlap with targets of nonsense-mediated decay. For example: Lrch4 has isoforms that are classified as protein-coding, e.g. ENSMUST00000031734, and isoforms that are targets of nonsense-mediated decay (NMD), e.g. ENSMUST00000177477. Both have the same ENSMUSG... ID (ENSMUSG00000093445). In addition, Lrch4 shares most of its protein coding exons with Gm20605, that is also a target of NMD, but has a different ENSMUSG... ID. In these cases, I don't see the rule yet.
For everybody else who might read this, I still have to clarify: Not all of the roughly 900 exons that I mentioned in my initial post are in conflict with the apparently a bit simplified statement of Ensembl. Only about 400 exons are shared between coding regions of different genes in the mouse genome. These are the cases Emily is talking about in her answer.
I've sent your list onto our genebuild team, so hopefully we can get some answers.