I'm interested in identifying potential proteins that could map to novel splice isoforms.
I have run cufflinks and I have a list of high confidence isoforms which might be novel. Now I want to determine if any of these could code for proteins. I have written code that outputs a polypeptide sequence based on the exons that cufflinks identified as belonging to said transcript. I'm pretty lost at this point because I don't have a clear understanding of how to construct my read frames. I'm hoping to explain where I am so far so that someone can tell me where I've made incorrect assumptions. Thanks.
Here's what the code does:
1-> It gets a list of potentially novel isoforms from Cuffcompare .tmap file
SLMO2-ATP5E NR_037929 j CUFF.72292 CUFF.72292.1 100 637.607724 628.443874 646.771574 23542.49981
2-> It gets all exons for CUFF.72292.1 from cuffcompare combined file:
chr20 Cufflinks exon 57601521 57601524 . - . gene_id "XLOC_046344"; transcript_id "TCONS_00063383"; exon_number "1"; gene_name "SLMO2-ATP5E"; oId "CUFF.72292.1"; nearest_ref "NR_037929"; class_code "j"; tss_id "TSS52060";
chr20 Cufflinks exon 57603862 57603896 . - . gene_id "XLOC_046344"; transcript_id "TCONS_00063383"; exon_number "2"; gene_name "SLMO2-ATP5E"; oId "CUFF.72292.1"; nearest_ref "NR_037929"; class_code "j"; tss_id "TSS52060";
chr20 Cufflinks exon 57605358 57605484 . - . gene_id "XLOC_046344"; transcript_id "TCONS_00063383"; exon_number "3"; gene_name "SLMO2-ATP5E"; oId "CUFF.72292.1"; nearest_ref "NR_037929"; class_code "j"; tss_id "TSS52060";
chr20 Cufflinks exon 57607275 57607422 . - . gene_id "XLOC_046344"; transcript_id "TCONS_00063383"; exon_number "4"; gene_name "SLMO2-ATP5E"; oId "CUFF.72292.1"; nearest_ref "NR_037929"; class_code "j"; tss_id "TSS52060";
Here's where I'm confused..
3->Based on the strand, it grabs each exon DNA sequence from the chromosome fasta file, combines them, and constructs three peptides (one for each frame):
(-)Frame_0:
[CGAEKAKTPD*KDADLAGRLGCNGRRTAKPGCSRRKRCRTTG*PLSDLSRCRL*GSRHVFVTLYVTSVLSFVYDSSEDRRCIFNTFISSLLDGTDFELYDVKVP]
(-)Frame_1:
[AGRRRRRHQTRRTPTWRADSAVTAAEPLSRAARGESDVVPPDDLCPT*VDVGYEGLDTFSSLST*LLS*VSFTTLLKTVVAFLTLSFLPY*MGLISNFTM*RF]
(-)Frame_2:
[RGGEGEDTRLEGRRLGGPTRL*RPQNR*AGLLEAKAMSYHRMTSVRPESM*AMRV*TRFRHSLRDFCLKFRLRLF*RPSLHF*HFHFFLIRWD*FRTLRCKGS]
Reasons for confusion:
- I am unsure whether it was correct to build a read frame from the entire sequence (connecting exons head to tail), as opposed to each exon individually (before concatenation).
- I am unsure whether a transcript can change read frames from exon to exon during splicing as this would very much complicate things.
- I'm not certain about whether a read frame is always contained entirely within the AG-GU boundaries. In other words, is it possible for the G on either side to be included in the frame?
- For protein inference, can there exist a methionine in addition to the start site or is this invalid? For instance:
MKPGCSRRKRCRTTG*
(valid?),MKPGCSRMKRCRTTG*
(invalid?)
Thanks!
-Jeremy
Thanks Devon!