Question

Still confused about exons versus CDS

11

Entering edit mode

10.9 years ago

lilla.davim ▴ 180

Hello,

I thought I had understood the difference between the 2 terms but I am afraid I still need a clear explanation. Is the following correct?

Exon: A sequence which remains present in a mature RNA.
CDS: A sequence which remains present in a mature RNA and codes for a protein (i.e. gets translated).

Based on these definitions, I would expect that CDS are necessarily included in exons. Now in the UCSC online page for "Get Genomic Sequence Near Gene", I have the following (exclusive) displaying choice:

Exons in upper case, everything else in lower case
CDS in upper case, UTR in lower case

I would therefore expect that when I select option 2, there are less nucleotides in upper case than in option 1.

But if I compare the results for the 2 options on the same sequence, I observe the following:

A) Entire sequences in upper case in option 1 become lower case in option 2
B) Entire sequences in lower case in option 1 become upper case in option 2

I can understand A (part of the exons which are UTR and thus non-coding become lower case in option 2), but I don't understand at all why B also happens.

Any clue?

Thanks for your help.

exon cds • 42k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by lilla.davim ▴ 180

0

Entering edit mode

Hello Adrian,

Ok so your definitions correspond to mine, i.e., CDS are included in exons.

You could be looking at something with very small Introns and large UTRs, in which case option 2 will have more lower case than option 1.

I guess option 2 should always have more (or equal) lower case bases than option 1.

Maybe the reason why you observe A and B is because your gene is not protein coding? ergo, no CDS?

But in this case, why does B happen, and everything is not simply lower case with option 2??

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by lilla.davim ▴ 180

0

Entering edit mode

B) should never happen. Can you give an example gene? That would seem to be an error in the annotation (though the UCSC annotations aren't that great, use Ensembl).

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

Here is an example in the 1st entry of the following fasta, which goes from lower to upper case. All other parameters are the same, only the display options are different:

With option 1:

http://genome.ucsc.edu/cgi-bin/hgc?hgsid=381409833_XLyyszThcNrKOH4dtiDua1kPTT1k&g=htcDnaNearGene&i=uc002wxs.3&c=chr20&l=30946146&r=31027122&o=knownGene&boolshad.hgSeq.promoter=0&hgSeq.promoterSize=1000&hgSeq.utrExon5=on&boolshad.hgSeq.utrExon5=0&hgSeq.cdsExon=on&boolshad.hgSeq.cdsExon=0&hgSeq.utrExon3=on&boolshad.hgSeq.utrExon3=0&hgSeq.intron=on&boolshad.hgSeq.intron=0&boolshad.hgSeq.downstream=0&hgSeq.downstreamSize=1000&hgSeq.granularity=feature&hgSeq.padding5=0&hgSeq.padding3=0&hgSeq.splitCDSUTR=on&boolshad.hgSeq.splitCDSUTR=0&hgSeq.casing=cds&boolshad.hgSeq.maskRepeats=0&hgSeq.repMasking=lower&submit=submit

With option 2:

http://genome.ucsc.edu/cgi-bin/hgc?hgsid=381409833_XLyyszThcNrKOH4dtiDua1kPTT1k&g=htcDnaNearGene&i=uc002wxs.3&c=chr20&l=30946146&r=31027122&o=knownGene&boolshad.hgSeq.promoter=0&hgSeq.promoterSize=1000&hgSeq.utrExon5=on&boolshad.hgSeq.utrExon5=0&hgSeq.cdsExon=on&boolshad.hgSeq.cdsExon=0&hgSeq.utrExon3=on&boolshad.hgSeq.utrExon3=0&hgSeq.intron=on&boolshad.hgSeq.intron=0&boolshad.hgSeq.downstream=0&hgSeq.downstreamSize=1000&hgSeq.granularity=feature&hgSeq.padding5=0&hgSeq.padding3=0&hgSeq.splitCDSUTR=on&boolshad.hgSeq.splitCDSUTR=0&hgSeq.casing=exon&boolshad.hgSeq.maskRepeats=0&hgSeq.repMasking=lower&submit=submit

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by lilla.davim ▴ 180

0

Entering edit mode

If you tell it to include introns and select "CDS in upper case, UTR in lower case", then the case of the introns will probably be whatever it is in the genome to begin with (upper case in the example you gave). There's no option for "CDS in upper case, everything else in lower case" as there is for exons.

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

Ram · Answer 1 · 2014-06-28

Exons = gene - introns

CDS = gene - introns - UTRs

therefore also:

CDS = Exons - UTRs

Hope this helps in clarifying things. It depends what organism you are looking at for your expectation to be true. You could be looking at something with very small Introns and large UTRs, in which case option 2 will have more lower case than option 1.

Maybe the reason why you observe A and B is because your gene is not protein coding? ergo, no CDS?

Ram · Answer 2 · 2014-06-29

6

Entering edit mode

10.9 years ago

Bert Overduin ★ 3.7k

Hello Lilla,

Your understanding of exons and CDS is correct.

It's just that the UCSC formatting options are confusing. Some experimenting myself suggests that:

"Exons in upper case, everything else in lower case" means:

UTRs in upper case
CDS in upper case
introns in lower case

"CDS in upper case, UTR in lower case" means:

UTRs in lower case
CDS in upper case
introns in upper case (!!!!)

Hope this explains.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Bert Overduin ★ 3.7k

0

Entering edit mode

Hello,

Thanks a lot for your reply, which makes things much clearer now. The only thing which actually remains completely unclear is UCSC's rationale for implementing things this way! Btw is there any other method (using UCSC, Ensembl or else) to generate, for a given sequence in an assembly (GRCh37 or GRCh38 is ok for me) CDS in upper cases and everything else in lower case?

Thanks!

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by lilla.davim ▴ 180

0

Entering edit mode

That is a question for the UCSC genome browser team, I'm afraid.

As for an easy way to get CDSs in upper case and the rest of the sequence in lower case, I am not aware of any. You also should keep in mind that doing this for a transcript sequence and doing this for a genomic sequence can give different results. While a transcript has only one CDS (or none, in case it is non-coding), a genomic sequence can, because of alternative splicing of transcripts, contain various CDSs, that can (partially) overlap each other.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Bert Overduin ★ 3.7k

0

Entering edit mode

You could always just use R or biopython/bioperl. They'd take longer to get what you want, but then you would know that the output is exactly what's desired.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.9 years ago by Devon Ryan 105k