Let's say I have the string
AACAAAACAAAAAGTTGGGTAGTTTGAGAAC
and I want to scramble it for statistical testing purposes, but keep the same nucleotides and dinucleotides. Is there a script/library that does this?
Thanks.
Let's say I have the string
AACAAAACAAAAAGTTGGGTAGTTTGAGAAC
and I want to scramble it for statistical testing purposes, but keep the same nucleotides and dinucleotides. Is there a script/library that does this?
Thanks.
Python's random module has a shuffle method that allows you to shuffle elements of a list. You can do something like this:
import random
#shuffle works on lists, you have to convert your string into a list object
seq = list('AACAAAACAAAAAGTTGGGTAGTTTGAGAAC')
random.shuffle(seq)
print ''.join(seq)
Probably not the most efficient way to do it if you have a very long sequence.
To do it by di-nucleotides, just break your string up into a di-nucleotide list:
seq = 'AACAAAACAAAAAGTTGGGTAGTTTGAGAAC'
seq = [seq[x:x+2] for x in range(0,len(seq),2)]
random.shuffle(seq)
print ''.join(seq)
Note that for odd length sequences, the last element will be a single nucleotide. If you want to remove that, check for odd length and pop off the last element of the list.
edit**
As brent and hengli commented, this actually doesn't preserve all the di-nucleotide frequency. A more sophisticated method will be needed to do this.
I guess you can treat this problem similar to a sequence assembly. Generate k-mers with k=2 for the sequence and make a redundant debruijn graph. All possible complete traversals of this graph would be the possible shuffled sequences. Then choose a random one out of the possible traversals.
Getting complete traversals becomes a traveling salesman problem....
Brentp's solution further down is probably the best solution you can achieve without doing crazy algorithms.
EDIT: this will not have the exact di-nulceotide frequency, but the emission probability is based on the input sequence so it will be close
If you actually want to maintain the di-nucleotide frequency this will do it in python:
from collections import defaultdict, Counter
import random
def difreq(seq):
counts = defaultdict(lambda: defaultdict(int))
for a, b in zip(seq, seq[1:]):
counts[a][b] += 1
return dict((k, dict(v)) for k,v in counts.iteritems())
# <http://stackoverflow.com/questions/3679694/a-weighted-version-of-random-choice>
weighted_choice = lambda s : random.choice(sum(([v]*wt for v,wt in s),[]))
def shuffle_difreq(seq):
freqs = difreq(seq)
# get the first base by the total frequency across the sequence
shuff_seq = [weighted_choice(Counter(seq).items())]
for i in range(1, len(seq)):
# each following base is based of the frequency of the previous base
# and their co-occurence in the original sequence.
shuff_seq.append(weighted_choice(freqs[shuff_seq[-1]].items()))
return "".join(shuff_seq)
if __name__ == "__main__":
seq = 'AACAAAACAAAAAGTTGGGTAGTTTGAGAAC' * 20
print difreq(seq)
shuff_seq = shuffle_difreq(seq)
print difreq(shuff_seq)
the "emission probability" of any given base depends on its frequency with the previous base in the original sequence. This example will print out the coocurrence frequencies so that you can verify that is the case.
This algorithm has a little hiccup. Try running
shuffle_difreq("TTTTTTAATTATGTTGTATTTTAGGTTTC") a couple of times and you will get an error. It seems to happen when a nucleotide which is only included once in the string is in the first or last position. But it does not happen all the time.
If the last base in the new sequence doesnt have a transition probability (like 'C') in your example, then it will cause this error. You can get around it with this function:
def shuffle_difreq(seq):
freqs = difreq(seq)
# get the first base by the total frequency across the sequence
shuff_seq = [None]
while not shuff_seq[0] in freqs:
shuff_seq = [weighted_choice(Counter(seq).items())]
while len(shuff_seq) < len(seq):
# each following base is based of the frequency of the previous base
# and their co-occurence in the original sequence.
try:
shuff_seq.append(weighted_choice(freqs[shuff_seq[-1]].items()))
except KeyError:
shuff_seq.pop()
assert len(shuff_seq) == len(seq)
return "".join(shuff_seq)
This won't give you a shuffle sequence with the exact same di-nucleotide frequency right? It will give a shuffle sequence with something very close the di-nucleotide frequency though. But if the sequence is short, the shuffled and original sequence will diverge more relatively.
Thanks, this is the correct answer. It seems to have been a more knotty problem than I had thought.
Mathematically speaking, is it possible to say how "bad" this algorithm will be for reads with 26-31 nucleotides in length? I see that you in your example use a string 20 times longer than mine.
If this is the type of sequence you want to shuffle, a one-liner will do:
$ perl -MList::Util -le 'print(join "", List::Util::shuffle(split //, "AACAAAACAAAAAGTTGGGTAGTTTGAGAAC"))'
ATGGCAAAAGAGGATGCTTTACAGAATAAAA
You can put that in a script to do whatever calculations you need. If you are working with longer sequences, I'd suggest using Pierre's solution or an existing code library (IIRC, genometools has some sequence shuffling programs and a python API). Note this just shuffles the sequence irrespective of the dinucleotides.
This keeps the same dinucleotides:
$ perl -MList::Util -le 'print(join "", List::Util::shuffle(unpack("(A2)*", "AACAAAACAAAAAGTTGGGTAGTTTGAGAAC")))'
AATTAAAGTGCGTAACATTAGACAGGGAAAA
ETA: This solution will be as fast as any C/C++ solution for shuffling (shuffle<> is written in C), but join will use memory and likely cause the whole program to be slower than one written in a compiled language. For sequences of this length though, I don't think it will make a noticeable difference (in terms of speed) how you decide to solve this problem.
ETA2: After following the discussion above, it appears that the second approach will not do what you want in terms of dinucleotide frequencies. So, a fast solution with the wrong answer is still no help ;-). It will be interesting to see how someone works this out.
in c++. I used the poly-XXX instead of the di-nucleotides. Feel free to break the 'j' loop when j-i>2.
#include <iostream>
#include <cstdlib>
#include <string>
#include <vector>
#include <ctime>
#include <algorithm>
using namespace std;
int main(int argc,char** argv)
{
if(argc!=2) return -1;
srand(time(0));
size_t i=0;
vector<string*> nucl;
while(argv[1][i]!=0)
{
size_t j=i+1;
while(argv[1][j]!=0 && argv[1][i]==argv[1][j])
{
++j;
}
nucl.push_back(new string(&argv[1][i],j-i));
i=j;
}
for(;;)
{
random_shuffle(nucl.begin(),nucl.end());
for(size_t i=0;i< nucl.size();++i) cout << *nucl[i];
cout << endl;
}
return 0;
}
$ g++ -Wall -O3 code.cpp && ./a.out AACAAAACAAAAAGTTGGGTAGTTTGAGAAC | head
CAAAAGTTTAAAAAAAGCCAAGTTTGGGGAA
GGAAAAAGCAAAAACTTTTTTCAAAGGGGAA
GGGGAAAACAATTGTTTAATAAAAAAACCGG
AGGGACTTTAAAAATTGAAAAGGCGAAAACT
AATTTGTAGTTACAAAAAAGGGGGAAAAACC
CGACGTTATTTTAACGAAAAGGGGAAAAAAA
AAAAACGTTTGGAACTTTGGGAAAAAAACAG
CAGAAAAGGGAGCAAAAAGTTTAAAATTTGC
AATATTTCTTAGGAAGCGGGGAAAAACAAAA
GGGCAGTTCAAAAAGTTTCAAAAGAAAATGA
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Keeping the exact di-nucleotide frequency is interesting. It can be reduced to a kind of graph traversal problem. In the directed graph, the vertices are A/C/T/G with an arc between each of them. Each arc is associated with a count c(a,b), the count of corresponding di-nucleotide (a,b) in the original sequence. The goal is to find a path using arc (a,b) exactly c(a,b) times. I do not know if there is an efficient algorithm to enumerate all the solutions. A naive NP algorithm is to do a simple backtracking. I think all the answers so far do not guarantee to keep the exact frequency.
PS: Another reduction: break the original sequence into 2-mer "read" and the goal is to assemble a new "contig" using each "read" exactly once. A NP-hard Hamilton problem. I think there are no polynomial solutions anyway.
Yeah you are right. The listed solutions will not take into account all the di-nucleotides since spliting the sequence by every 2 nucleotides misses the di-nucleotides "between" each splitted di-nucleotide. IE. AGTCGA => AG, TC, GA. But we are missing GT, CG di-nucleotides.
I like your graph formulation but maybe we can modify it to allow an efficient algorithm. Instead of making a complete graph with edge counts we may create a multigraph with c directed edges between (a,b) instead. Solutions correspond to eulerian trails of this graph. The graph must have at least one such trail.
Yeah, you are of course right. My brain was dysfunctioning... Then the question is then how to modify Hierholzer's algorithm to randomly sample one Eulerian path in linear time. Enumerating all solutions is still non-polynomial.