Question

Dual reading frames

0

Entering edit mode

11.0 years ago

milady81 ▴ 70

Dear Scientists,

I have question regarding the Dual reading frames.

I did the assembly of RNA-seq from HiSeq. I got genes coding the same proteins but they are on different frames.

Example ProtA is encoded in: comp34909 (Length=1448) FRAME:+2 and comp26759 (Length=1177) FRAME:-1

The protein coding regions are the same for both sequences.

It means I have the Dual Reading Frames? What does it mean? How to explain such example? What more analysis should I do?

I have a lot of similar sequences.

Thank you very much for your answer:)

Dorota

RNA-Seq Assembly • 2.5k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 11.0 years ago by milady81 ▴ 70

1

Entering edit mode

Dear Michael,

Thank you very very very much for such nice replay:) I am still learning and your information was very very usefull for me.

First: yes I have non-strand specific, and to do the assembly I used Trinity (before also SOAPTrans, but was inefficient with my data).

What I did after the assembly I just run the tblastn with seeds as a query for interesting to me proteins (well-known in other lineage then in organism I analyzed). Next I took sequences which are encoding interesting to me proteins, I translated contigs with Expasy ( it shows all -+3 frames) , I choose the best alignment, and re run the protein sequences against the transcriptome. Then I saw that for each protein I have two matches (with two different frames).

You wrote:

If the contigs are not from the same transcript, this could be observed as well. Check the surrounding sequences of the CDS for similarity, it might be a duplication, misassembly, etc. Check the raw coverage and mismatches for both contigs.

You meant example: to check in IGV the sequences before identified by me, and after? I realy don't know how to check, what program to use to see how te scaffold is arrange.

Once again thank you Michael, your answer was very usefull, explained simply like I like:) an understanding by me:)

Cheers
DB

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 11.0 years ago by milady81 ▴ 70

0

Entering edit mode

Hi DB, what I meant with that a bit convoluted sentence was: "Try to find out if the contigs are from the same transcript."

Run Blastn of both contigs against each other, is there sequence similarity also for the sequences surrounding the coding sequence? Or is the similarity only within the coding sequence?
Align your original reads to the assembled contigs and view the result in IGV, are there reads bridging (by sequence or matching pairs) the boundary between predicted coding sequence and flanking sequence?
If you want to be sure, run a PCR (DNA) and RT-PCR using primers for each transcript.
Is it worth the effort?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by Michael 55k

Ram · Answer 1 · 2014-05-07

Hi,

I think that is to be expected. If I understand correctly, you have non strand-specific RNA seq reads (please verify if this is correct). You assembled your reads into contigs (using which assembly?) and ran some sort of gene or ORF prediction (which?) on the assembled contigs.

Now you have found two gene predictions on different contigs with identical coding sequence, but on opposite strand and different reading frame. That is totally fine, because strand and frame are mostly irrelevant for your case. Why:

strand is mostly irrelevant, because your protocol was (probably) not strand specific, so it is totally possible to pick up the same transcript on both strands (you might ask why the assembler hasn't picked up and joined to otherwise identical sequence). The only thing you need the strand information for is translation into AA sequence, but you might want to use blastx anyway.
If the contigs are not from the same transcript, this could be observed as well. Check the surrounding sequences of the CDS for similarity, it might be a duplication, misassembly, etc. Check the raw coverage and mismatches for both contigs.
frame is mostly irrelevant because it is counted from the contig start to the placement of stop codon. The contig start will depend on the random fragmentation of the mRNA and the subsequent assembly, and +-3 bases means nothing. The only case where you can make use of the frame is for translation of the CDS (but you might use blastx anyway) or if there are multiple genes on this contig to see their relative orientation.

So, the only case where you need to use this information is to translate the predicted CDS to AA, e.g. for tools that don't check all six reading frames.