Still confused about exons versus CDS
2
9
Entering edit mode
10.4 years ago
lilla.davim ▴ 160

Hello,

I thought I had understood the difference between the 2 terms but I am afraid I still need a clear explanation. Is the following correct?

  • Exon: A sequence which remains present in a mature RNA.
  • CDS: A sequence which remains present in a mature RNA and codes for a protein (i.e. gets translated).

Based on these definitions, I would expect that CDS are necessarily included in exons. Now in the UCSC online page for "Get Genomic Sequence Near Gene", I have the following (exclusive) displaying choice:

  1. Exons in upper case, everything else in lower case
  2. CDS in upper case, UTR in lower case

I would therefore expect that when I select option 2, there are less nucleotides in upper case than in option 1.

But if I compare the results for the 2 options on the same sequence, I observe the following:

  • A) Entire sequences in upper case in option 1 become lower case in option 2
  • B) Entire sequences in lower case in option 1 become upper case in option 2

I can understand A (part of the exons which are UTR and thus non-coding become lower case in option 2), but I don't understand at all why B also happens.

Any clue?

Thanks for your help.

exon cds • 39k views
ADD COMMENT
0
Entering edit mode

Hello Adrian,

Ok so your definitions correspond to mine, i.e., CDS are included in exons.

You could be looking at something with very small Introns and large UTRs, in which case option 2 will have more lower case than option 1.

I guess option 2 should always have more (or equal) lower case bases than option 1.

Maybe the reason why you observe A and B is because your gene is not protein coding? ergo, no CDS?

But in this case, why does B happen, and everything is not simply lower case with option 2??

ADD REPLY
0
Entering edit mode

B) should never happen. Can you give an example gene? That would seem to be an error in the annotation (though the UCSC annotations aren't that great, use Ensembl).

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

If you tell it to include introns and select "CDS in upper case, UTR in lower case", then the case of the introns will probably be whatever it is in the genome to begin with (upper case in the example you gave). There's no option for "CDS in upper case, everything else in lower case" as there is for exons.

ADD REPLY
19
Entering edit mode
10.4 years ago
Adrian Pelin ★ 2.6k

Exons = gene - introns

CDS = gene - introns - UTRs

therefore also:

CDS = Exons - UTRs

Hope this helps in clarifying things. It depends what organism you are looking at for your expectation to be true. You could be looking at something with very small Introns and large UTRs, in which case option 2 will have more lower case than option 1.

Maybe the reason why you observe A and B is because your gene is not protein coding? ergo, no CDS?

ADD COMMENT
5
Entering edit mode
10.4 years ago
Bert Overduin ★ 3.7k

Hello Lilla,

Your understanding of exons and CDS is correct.

It's just that the UCSC formatting options are confusing. Some experimenting myself suggests that:

"Exons in upper case, everything else in lower case" means:

  • UTRs in upper case
  • CDS in upper case
  • introns in lower case

"CDS in upper case, UTR in lower case" means:

  • UTRs in lower case
  • CDS in upper case
  • introns in upper case (!!!!)

Hope this explains.

ADD COMMENT
0
Entering edit mode

Hello,

Thanks a lot for your reply, which makes things much clearer now. The only thing which actually remains completely unclear is UCSC's rationale for implementing things this way! Btw is there any other method (using UCSC, Ensembl or else) to generate, for a given sequence in an assembly (GRCh37 or GRCh38 is ok for me) CDS in upper cases and everything else in lower case?

Thanks!

ADD REPLY
0
Entering edit mode

That is a question for the UCSC genome browser team, I'm afraid.

As for an easy way to get CDSs in upper case and the rest of the sequence in lower case, I am not aware of any. You also should keep in mind that doing this for a transcript sequence and doing this for a genomic sequence can give different results. While a transcript has only one CDS (or none, in case it is non-coding), a genomic sequence can, because of alternative splicing of transcripts, contain various CDSs, that can (partially) overlap each other.

ADD REPLY
0
Entering edit mode

You could always just use R or biopython/bioperl. They'd take longer to get what you want, but then you would know that the output is exactly what's desired.

ADD REPLY

Login before adding your answer.

Traffic: 1515 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6