Been a while since we had a code golf challenge (shortest script wins, but expect to receive upvotes for cleverness)!
Simplifying slightly, bisulfite conversion means that Cs get converted to Ts, but some bases may be protected by methylgroups and thus remain unconverted. (This happens on the opposite strand too, so some complimentary Gs could get converted to As)
So, given an input sequence read (assume a [ACTGN] string of no more than 150 bp), output all reads that might result from partial or complete bisulfite conversion of that sequence.
EDIT: To be clear, C>T happens on one read, G>A happens on the other. Reads will not exist with both conversions! See this screenshot:
Small example: Input:
AACGCGAA
Output:
AATGTGAA
AATGCGAA
AACGTGAA
AACACAAA
AACACGAA
AACGCAAA
AACGCGAA
(the original string could be in there as well, since all bases could be protected!)
Happy Golfing!
So that we can test our outputs, what is the output for
ACTGN
?after I saw the other's answers, i'm not sure if we only need to change the 'C' in the sequence or the 'C' AND the 'G'.Both need to change since the methylation and read you are looking at could be from either strand.
If I am understanding the problem correctly, you can't change C->T and G->A for the same read.
The only conversion chemically occurring is C->T, so there are 3 conversions possible for the upper strand, and two for the lower strand.
If my interpretation is correct, Pierre Lindenbaum and jrj.healey solutions are incomplete, and zx8754 solution is incorrect.edit: my interpretation was right, but all answers have been corrected and updated.
This is correct, it's either C->T or G->A, but not both!
OK, R code corrected, gives same output as expected.
Updated this post with more details and example input/output. Sorry for the confusion!