Code Golf - bisulfite conversion

7

Entering edit mode

6.8 years ago

Chris Miller 22k

Been a while since we had a code golf challenge (shortest script wins, but expect to receive upvotes for cleverness)!

Simplifying slightly, bisulfite conversion means that Cs get converted to Ts, but some bases may be protected by methylgroups and thus remain unconverted. (This happens on the opposite strand too, so some complimentary Gs could get converted to As)

So, given an input sequence read (assume a [ACTGN] string of no more than 150 bp), output all reads that might result from partial or complete bisulfite conversion of that sequence.

EDIT: To be clear, C>T happens on one read, G>A happens on the other. Reads will not exist with both conversions! See this screenshot:

Small example: Input:

AACGCGAA

Output:

AATGTGAA
AATGCGAA
AACGTGAA
AACACAAA
AACACGAA
AACGCAAA
AACGCGAA

(the original string could be in there as well, since all bases could be protected!)

Happy Golfing!

bisulfite bsseq code-golf • 3.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 6.8 years ago by Chris Miller 22k

0

Entering edit mode

So that we can test our outputs, what is the output for ACTGN?

ADD REPLY • link 6.8 years ago by zx8754 12k

0

Entering edit mode

~~after I saw the other's answers, i'm not sure if we only need to change the 'C' in the sequence or the 'C' AND the 'G'.~~

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Both need to change since the methylation and read you are looking at could be from either strand.

ADD REPLY • link 6.8 years ago by GenoMax 152k

1

Entering edit mode

If I am understanding the problem correctly, you can't change C->T and G->A for the same read.

5'- ATGCCCATG -3'
3'- TACGGGTAC -5'

The only conversion chemically occurring is C->T, so there are 3 conversions possible for the upper strand, and two for the lower strand. ~~If my interpretation is correct, Pierre Lindenbaum and jrj.healey solutions are incomplete, and zx8754 solution is incorrect.~~

edit: my interpretation was right, but all answers have been corrected and updated.

ADD REPLY • link 6.8 years ago by h.mon 35k

0

Entering edit mode

This is correct, it's either C->T or G->A, but not both!

ADD REPLY • link 6.8 years ago by Chris Miller 22k

0

Entering edit mode

OK, R code corrected, gives same output as expected.

ADD REPLY • link 6.8 years ago by zx8754 12k

0

Entering edit mode

Updated this post with more details and example input/output. Sorry for the confusion!

ADD REPLY • link 6.8 years ago by Chris Miller 22k

5

Entering edit mode

6.8 years ago

Joe 22k

A python solution

Edit II: gained 8 bytes (now 108):

NB This doesn’t solve the problem completely correctly as present (was written before OP clarified).

from itertools import product as p
print(["".join(s) for s in p(*[i+{'C':'t','G':'a'}.get(i,'') for i in ""])])

Edit: a little enhancement at the cost of readability:

As a pseudo-oneliner, without counting the input string as bytes:

from itertools import product print(["".join(s) for s in product(*[i + {'C':'t'}.get(i, '') for i in ""])])

~~112 bytes, or 116, if the dictionary should read {'C':'t','G':'a'}.~~

Not 100% sure if this is correct, but I get the same output as Pierre:

from itertools import product
s = [i + {'C':'t'}.get(i, '') for i in "ATGCCCATG"]
print(["".join(sub) for sub in product(*s)])

127 bytes.

['ATGCCCATG', 'ATGCCtATG', 'ATGCtCATG', 'ATGCttATG', 'ATGtCCATG', 'ATGtCtATG', 'ATGttCATG', 'ATGtttATG']

If the G substitution is also needed (OP was unclear):

from itertools import product
s = [i + {'C':'t','G':'a'}.get(i, '') for i in "ATGCCCATG"]
print(["".join(sub) for sub in product(*s)])

I think I get the same as zx8754:

['ATGCCCATG', 'ATGCCCATa', 'ATGCCtATG', 'ATGCCtATa', 'ATGCtCATG', 'ATGCtCATa', 'ATGCttATG', 'ATGCttATa', 'ATGtCCATG', 'ATGtCCATa', 'ATGtCtATG', 'ATGtCtATa', 'ATGttCATG', 'ATGttCATa', 'ATGtttATG', 'ATGtttATa', 'ATaCCCATG', 'ATaCCCATa', 'ATaCCtATG', 'ATaCCtATa', 'ATaCtCATG', 'ATaCtCATa', 'ATaCttATG', 'ATaCttATa', 'ATatCCATG', 'ATatCCATa', 'ATatCtATG', 'ATatCtATa', 'ATattCATG', 'ATattCATa', 'ATatttATG', 'ATatttATa']

@Chris, it would be useful if you had some 100% correct example output?

ADD COMMENT • link 6.8 years ago by Joe 22k

0

Entering edit mode

Very interesting ! Could you detailed please :)

ADD REPLY • link 6.8 years ago by Bastien Hervé 6.4k

2

Entering edit mode

Each iterable variation on the input string is stored, with a dictionary guiding the substitutions for each character in the string (for i in "ATG...").

According to the docs, the imported itertools module is functionally equivalent to nested for-loops (https://docs.python.org/3/library/itertools.html#itertools.product), allowing you to yield permutations of the input iterables, as 'cartesian products'. Its kind of similar to itertools.permutations, but that would give all the shuffled possibilities (though its following that train of thought that got me to this).

I confess I'm black-boxing this a little as it's heavily based upon https://stackoverflow.com/a/29184387/3691040 and I can't take all the credit. You'll be much better guided reading the docs than listening to me :P

ADD REPLY • link 6.8 years ago by Joe 22k

0

Entering edit mode

Yeah the product(*s) is quite murky to me. Anyhow it's really good coding !

ADD REPLY • link 6.8 years ago by Bastien Hervé 6.4k

0

Entering edit mode

The (*s) syntax is just telling the product function that it will receive a variable number of args, as the number of iterables returned will depend on the sequence and its length. The inner workings of product I don't have a very strong feeling for, as I suspect it required better linear algebra than I have haha..

ADD REPLY • link 6.8 years ago by Joe 22k

4

Entering edit mode

6.8 years ago

Pierre Lindenbaum 166k

In C :

	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>
	#include <ctype.h>

	static void recursive(
	const char* src,
	char* dna,
	const unsigned int pos,
	const unsigned int len,
	int side
	)
	{
	unsigned int i=0UL;
	if(pos==len)
	{
	fputs(dna,stdout);
	fputc('\n',stdout);
	return;
	}

	switch (src[pos]) {
	case 'A':case 'a':
	case 'T':case 't':
	case 'N':case 'n':
	dna[pos]=src[pos];
	recursive(src,dna,pos+1,len,side);
	break;
	case 'G':case 'g':
	dna[pos]='G';
	recursive(src,dna,pos+1,len,side);
	if(side==0) {
	dna[pos]='A';
	recursive(src,dna,pos+1,len,side);
	}
	break;
	case 'C':case 'c':
	dna[pos]='C';
	recursive(src,dna,pos+1,len,side);
	if(side==1) {
	dna[pos]='T';
	recursive(src,dna,pos+1,len,side);
	}
	break;
	default:fprintf(stderr,"bad base %c\n",src[pos]);abort();
	}
	}

	int main(int argc,char** argv)
	{
	size_t len;
	char* src;
	char* dna;
	if(argc!=2)
	{
	fprintf(stderr,"Usage %s DNA\n",argv[0]);
	return EXIT_FAILURE;
	}
	src = argv[1];
	len = strlen(src);
	if(len==0UL)
	{
	fprintf(stderr,"Bad DNA\n");
	return EXIT_FAILURE;
	}
	dna=(char)malloc(sizeof(char)(len+1));
	if(dna==0)
	{
	fprintf(stderr,"Out of memory\n");
	return EXIT_FAILURE;
	}
	strcpy(dna,src);
	recursive(src,dna,0,len,0);
	recursive(src,dna,0,len,1);
	free(dna);
	return EXIT_SUCCESS;
	}

view raw biostar341051.c hosted with ❤ by GitHub

compilation:

gcc -O3 -o biostar341051 biostar341051.c

usage:

$ ./biostar341051 ATGCCCATG
AACGCGAA
AACGCAAA
AACACGAA
AACACAAA
AACGCGAA
AACGTGAA
AATGCGAA
AATGTGAA

Edit: fixed after I saw your comments about 'G'->A

Edit2 : fixed after Chris's comment.

ADD COMMENT • link 6.8 years ago by Pierre Lindenbaum 166k

4

Entering edit mode

6.8 years ago

zx8754 12k

Using R, 212 bytes:

Note: I am converting C > t and G > a so it is easier to see the changes in the output

The function with whitespaces:

foo <- function(s){
  x = strsplit(s, "")[[1]]
  c(s, unlist(
    lapply(c("C", "G"), function(i){
      y = which(x == i)    
      sapply(seq(y), function(j) combn(y, j, function(k){
        z = x
        z[ k ] = setNames(c("t", "a"), c("C", "G"))[ z[ k ] ]
        paste(z, collapse = "")}))
    })))
  }

Usage:

foo("AACGCGAA")
# [1] "AACGCGAA" "AAtGCGAA" "AACGtGAA" "AAtGtGAA" "AACaCGAA" "AACGCaAA" "AACaCaAA"

ADD COMMENT • link 6.8 years ago by zx8754 12k

2

Entering edit mode

6.8 years ago

Alex Reynolds 36k

EDIT I may have a correct (Python-based) answer. Please see edited code below.

~~In addition to being shorter, jrj.healey's answer is also a bit more than twice as fast as building a powerset of substitution indices and reducing:~~

from functools import reduce
from itertools import product as p
import timeit

e = 'AACGCGAA'
m = {'C':'t','G':'a'}

def f(k,s,t):
    for i in s:
        t[i] = ord(m[k])
    return t.decode('ascii')

def a1():
    return [i for s in [[f(k,x,bytearray(e,'ascii')) for x in [reduce(lambda z,x: z + [y + [x] for y in z], s, [[]]) for s in [[i for i,v in enumerate(e) if v == k] for k in m.keys()]][i] if x != []] for i,k in enumerate(m.keys())] + [[e]] for i in s]

def a2():
    return ["".join(s) for s in p(*[i+{'C':'t','G':'a'}.get(i,'') for i in e])]

print("a1 ->", timeit.timeit(stmt=a1, number=1000))
print(a1)

print("a2 ->", timeit.timeit(stmt=a2, number=1000))
print(a2)

"""
$ python test341051.py
a1 -> 0.015046401007566601
['AAtGCGAA', 'AACGtGAA', 'AAtGtGAA', 'AACaCGAA', 'AACGCaAA', 'AACaCaAA', 'AACGCGAA']
a2 -> 0.006538485002238303
['AACGCGAA', 'AACGCaAA', 'AACGtGAA', 'AACGtaAA', 'AACaCGAA', 'AACaCaAA', 'AACatGAA', 'AACataAA', 'AAtGCGAA', 'AAtGCaAA', 'AAtGtGAA', 'AAtGtaAA', 'AAtaCGAA', 'AAtaCaAA', 'AAtatGAA', 'AAtataAA']
"""

For code golf purposes, the minified version uses ~~421 characters~~ 310 bytes. Might be worth comparing with R solution on time or memory, to see how it scales.

from functools import reduce
e='AACGCGAA'
m={'C':'t','G':'a'}
def f(k,s,t):
 for i in s:
  t[i]=ord(m[k])
 return t.decode('ascii')
print([i for s in[[f(k,x,bytearray(e,'ascii'))for x in[reduce(lambda z,x:z+[y+[x]for y in z],s,[[]])for s in[[i for i,v in enumerate(e)if v==k]for k in m.keys()]][i]if x!=[]]for i,k in enumerate(m.keys())]+[[e]]for i in s])

More positive controls may be useful.

ADD COMMENT • link 6.8 years ago by Alex Reynolds 36k

2

Entering edit mode

Forget efficiency :)

shortest script wins

Use http://bytesizematters.com/ and tick "Ignore whitespace".

ADD REPLY • link 6.8 years ago by zx8754 12k

0

Entering edit mode

haha okay 310 bytes, then!

ADD REPLY • link 6.8 years ago by Alex Reynolds 36k

0

Entering edit mode

mine still needs some work now Chris has clarified, it actually substitutes too many things :/

ADD REPLY • link 6.8 years ago by Joe 22k

0

Entering edit mode

Hmm, mine will work on each key of the substitution map m on its own, one at a time — so either C or G is substituted, but not both — so I think mine should give correct output. It would be useful to have some positive examples though, just to test this out.

ADD REPLY • link 6.8 years ago by Alex Reynolds 36k

0

Entering edit mode

When I edited the post, I added a small example!

ADD REPLY • link 6.8 years ago by Chris Miller 22k

0

Entering edit mode

Okay, I made some edits and re-ran things with output. I think mine matches your example. Some other input or edge cases may be useful to test.

ADD REPLY • link 6.8 years ago by Alex Reynolds 36k

Login before adding your answer.