forgi (github page) is a Python package for working with RNA secondary structures. I've used it in the past for tasks like this.
Here's an example. By the way, might want to check your expected output against the example sequence you provided.
import forgi.graph.bulge_graph as fgb
bg = fgb.BulgeGraph()
bg.from_fasta(
"""\
>demo
UGCCCUAGUACGAGAGGACCGGAGUG
...(((........))).........
""")
for d in bg.defines:
print d, bg.get_define_seq_str(d)
Here's the output. Notice it keeps the two sides of the stem separate, and the 5' and 3' "stems" are listed separately.
f1 ['UGC']
h0 ['AGUACGAG']
s0 ['CCU', 'AGG']
t1 ['ACCGGAGUG']
Edit: forna by the same author is useful for visualizing structures and is helpful for debugging and troubleshooting this sort of analysis.
Edit 2: In response to the comment asking about providing a string variable to from_fasta
:
It's not documented, but after playing around it looks like this method requires 1) the record must start with ">" -- not even a newline is allowed, and 2) a trailing newline at the end of the structure line. So if you have your strings sequence
and structure
, you can build a string that from_fasta
will accept like this:
text = '\n'.join(['> ', sequence, structure, '\n'])
bg.from_fasta(text)
or if you want to include a sequence ID:
text = '\n'.join(['\n', '> ' + seqid, sequence, structure, '\n'])
bg.from_fasta(text)
So let me clarify:
Considering your example, you are saying that
each
.
= nucleotide on loop(
and)
= nucleotide on stemRight?
It's the output of RNAfold.
The sequence is aligned with structure. for example, for the dot at position one you have to check the character at corresponding position in string. I am saving them both as strings and their length is same.
Here is my code:
yes this is exactly what i am saying