Hi!
I made a splice junction library to map the bowtie unmapped reads (to genome) like this:
chr167000051+67091529 ccaccatgatggaaggattgaaaaaacgta 30
chr167091593+67098752 ggacactgattctacaggttcaccagatag 30
chr167098777+67101626 gatagagatggaattcagcccagcccacac 30
chr167101698+67105459 ggaaaaaaagtttcgaagaaaagcaatggg 30
chr167105516+67108492 gattgggaaagatataactcacctgagctg 30
chr167108547+67109226 ccgaggaacccggctctaccaaaggaaagc 30
#It's a CSV file
Now we want to make a negative control to know if the reads aligns to the library more than expected to do by chance alone. To do that, we want to scramble the sequences at second column of the CSV generated.
I'm learning python and programming in general, and I made the script to generate the splice junction library by myself. So, you can help me telling me a python tip to scramble a string or letting me know about some tool to scramble a column of a CSV (or sequences of a multifasta file).
Bash tips using AWK, sed or wherever are also welcome.
Thanks a lot for your help!!!
using random.shuffle I get:
chr167000051+67091529 cgacaaaagtaacggactttaaaaggactg 30
chr167091593+67098752 ggcaaactattggtctataaatagccccgg 30
chr167098777+67101626 accgtgctcaatgaagacaaccgcgtaacg 30
chr167101698+67105459 atggttacggaacaaaaagaaaggaaaggt 30
chr167105516+67108492 ttaataggagaagcacagctcttagatcgg 30
chr167108547+67109226 gaaggcggcaaaccctacaatgacgccgca 30
thanks!
my mistake. I always thought random.sample was with replacement. But (as I am glad to learn) it is without replacement. http://docs.python.org/library/random.html
you need to use
"".join(shuffled_seq)
as Michael did because sample returns a list.random.sample doesn't retain the nucleotide frequency? are you sure?
however random.sample generates a wired output, it isn't what I need, so I'll try with random.shuffle :)