Question

Tair Seems To Show Incorrect Annotation For A Spliced Gene

0

Entering edit mode

10.9 years ago

Ritvik ▴ 30

Hi,

I can't seem to understand how come for a single gene ATMG00060.1, TAIR seems to show different CDS and cDNA sequence as both the CDS and cDNA have the same sequence length and the gene contains seemingly no 5' UTR sequence?

Corresponding links are as follows:

Gene ATMG00060.1: http://arabidopsis.org/servlets/TairObject?id=1000647816&type=gene

CDS : http://arabidopsis.org/servlets/TairObject?type=sequence&id=1002472305

cDNA : http://arabidopsis.org/servlets/TairObject?type=sequence&id=2002989388

Can anyone explain what's happening here?

Another very general question about splicing order:

Suppose my gene has two exons:

Exon 1's position is: complement[21691:22086] as i have the DNA sequence of the opposite strand

Exon 2's position is: complement [20570:20717] as i have the DNA sequence of the opposite strand

So which splicing order is correct :

Final spliced mRNA = Reverse complement of (Exon1 + Exon2) or Final spliced mRNA = Reverse complement of (Exon2 + Exon1)

Also, can anyone expand on the reason why in removing alternative splice variants, the one bearing longest CDS is selected for? Is there any relationship between CDS length and mRNA stability?

splicing mrna cds • 3.0k views

ADD COMMENT • link updated 10.9 years ago by Istvan Albert 101k • written 10.9 years ago by Ritvik ▴ 30

score 1 · Answer 1 · 2014-01-09

You should put the link to your gene and not the sequences, it is too hard to see if they are right or not. Edit your post, remove the sequence and add links to the gene.

In general one has to be cautious with these terms - these are not always used properly even by data sources. One can consult the Sequence Ontology for reference where it says that the definition of CDS is

A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

http://www.sequenceontology.org/browser/current_svn/term/SO:0000316

The definition of mRNA is:

Messenger RNA is the intermediate molecule between DNA and protein. It includes UTR and coding sequences. It does not contain introns.

http://www.sequenceontology.org/browser/current_svn/term/SO:0000234

score 1 · Answer 2 · 2014-01-10

1

Entering edit mode

10.9 years ago

Istvan Albert 101k

This is not about the correct splicing order but what do the terms CDS and mRNA actually mean.

As you note of course it can be greatly mislending and confusing and most likely millions of research dollars go to waste due to errors that these cause.

Usually bioinformatics representations that work off of coordinates will produce outputs that match the forward strand: for example the start coordinate is always the smaller number, even though the actual start from biological sense may be the higher coordinate. Representations that are sequence oriented (like mRNA) obey the correct directionality.

In this case it appears that the TAIR system will produce CDS in the coordinate representation whereas the mRNA represents the actual product.

In general I try to avoid working with CDS as it is almost never fully clear what someone means by that.

ADD COMMENT • link 10.9 years ago by Istvan Albert 101k

0

Entering edit mode

Thanks again for replying! If mRNAs are deemed to be more accurately represented, then what is the best method to extract CDS information from a gene on a genomic level?

Actually, i was trying to extract CDS and mRNA information from a chromosome genbank file but there were some less than 5% genes whose sequence didn't match like the one in this question. Is 5% an acceptable error rate or i am doing something fundamentally wrong here?

ADD REPLY • link 10.9 years ago by Ritvik ▴ 30

1

Entering edit mode

With biology we always have to be careful with the terminology thus this all comes down to what the word CDS actually means. The problem is usually (as above) that a site like TAIR gives you the CDS but does not tell you what in their interpretation CDS is.

Then as it is always almost the case there are clearly cases when the data does not seem to match. Could be errors or some type of conflicting information (that is not shown) made it so that a decision had to be made that ends up diverging.

Often it is easier to operate on coordinates rather than sequences as in those cases you can better see what each file is supposed to represent. So I would recommend to find the coordinates and use either bedtools getfasta if you have a BED12 file or gffread program if you have a GFF file to extract the sequences that you need.

ADD REPLY • link 10.9 years ago by Istvan Albert 101k

0

Entering edit mode

Ok, Will try what you have suggested.Once again, thanks for your help!

ADD REPLY • link 10.9 years ago by Ritvik ▴ 30

0

Entering edit mode

Hello,

I apologise for reviving a somewhat old topic.

If I use gffread for genomic features labelled as existing on the negative strand, will gffread find the reverse complement automatically when extracting the sequence from the fasta file, or will I have to implement an additional step to get that?

ADD REPLY • link 8.5 years ago by Thomas ▴ 160