Thank you very very very much for such nice replay:) I am still learning and your information was very very usefull for me.
First: yes I have non-strand specific, and to do the assembly I used Trinity (before also SOAPTrans, but was inefficient with my data).
What I did after the assembly I just run the tblastn with seeds as a query for interesting to me proteins (well-known in other lineage then in organism I analyzed). Next I took sequences which are encoding interesting to me proteins, I translated contigs with Expasy ( it shows all -+3 frames) , I choose the best alignment, and re run the protein sequences against the transcriptome. Then I saw that for each protein I have two matches (with two different frames).
You wrote:
If the contigs are not from the same transcript, this could be observed as well. Check the surrounding sequences of the CDS for similarity, it might be a duplication, misassembly, etc. Check the raw coverage and mismatches for both contigs.
You meant example: to check in IGV the sequences before identified by me, and after? I realy don't know how to check, what program to use to see how te scaffold is arrange.
Once again thank you Michael, your answer was very usefull, explained simply like I like:) an understanding by me:)
Cheers
DB
ADD REPLY
• link
updated 4.9 years ago by
Ram
44k
•
written 10.6 years ago by
milady81
▴
70
0
Entering edit mode
Hi DB, what I meant with that a bit convoluted sentence was: "Try to find out if the contigs are from the same transcript."
Run Blastn of both contigs against each other, is there sequence similarity also for the sequences surrounding the coding sequence? Or is the similarity only within the coding sequence?
Align your original reads to the assembled contigs and view the result in IGV, are there reads bridging (by sequence or matching pairs) the boundary between predicted coding sequence and flanking sequence?
If you want to be sure, run a PCR (DNA) and RT-PCR using primers for each transcript.
I think that is to be expected. If I understand correctly, you have non strand-specific RNA seq reads (please verify if this is correct). You assembled your reads into contigs (using which assembly?) and ran some sort of gene or ORF prediction (which?) on the assembled contigs.
Now you have found two gene predictions on different contigs with identical coding sequence, but on opposite strand and different reading frame. That is totally fine, because strand and frame are mostly irrelevant for your case. Why:
strand is mostly irrelevant, because your protocol was (probably) not strand specific, so it is totally possible to pick up the same transcript on both strands (you might ask why the assembler hasn't picked up and joined to otherwise identical sequence). The only thing you need the strand information for is translation into AA sequence, but you might want to use blastx anyway.
If the contigs are not from the same transcript, this could be observed as well. Check the surrounding sequences of the CDS for similarity, it might be a duplication, misassembly, etc. Check the raw coverage and mismatches for both contigs.
frame is mostly irrelevant because it is counted from the contig start to the placement of stop codon. The contig start will depend on the random fragmentation of the mRNA and the subsequent assembly, and +-3 bases means nothing. The only case where you can make use of the frame is for translation of the CDS (but you might use blastx anyway) or if there are multiple genes on this contig to see their relative orientation.
So, the only case where you need to use this information is to translate the predicted CDS to AA, e.g. for tools that don't check all six reading frames.
Dear Michael,
Thank you very very very much for such nice replay:) I am still learning and your information was very very usefull for me.
First: yes I have non-strand specific, and to do the assembly I used Trinity (before also SOAPTrans, but was inefficient with my data).
What I did after the assembly I just run the tblastn with seeds as a query for interesting to me proteins (well-known in other lineage then in organism I analyzed). Next I took sequences which are encoding interesting to me proteins, I translated contigs with Expasy ( it shows all -+3 frames) , I choose the best alignment, and re run the protein sequences against the transcriptome. Then I saw that for each protein I have two matches (with two different frames).
You wrote:
You meant example: to check in IGV the sequences before identified by me, and after? I realy don't know how to check, what program to use to see how te scaffold is arrange.
Once again thank you Michael, your answer was very usefull, explained simply like I like:) an understanding by me:)
Cheers
DB
Hi DB, what I meant with that a bit convoluted sentence was: "Try to find out if the contigs are from the same transcript."