Hi all.Does anybody know how to generate random DNA sequences (about 20) with equal base frequencies? (I want to generate this data for a test)
Hi all.Does anybody know how to generate random DNA sequences (about 20) with equal base frequencies? (I want to generate this data for a test)
A solution using Python:
import random
def random_dna_sequence(length):
return ''.join(random.choice('ACTG') for _ in range(length))
You want a DNA string with equal base probability. So the probability of each base appearing is 0.25
. With the following function you can check how much each DNA string deviates from this predicted probability:
def base_frequency(dna):
d = {}
for base in 'ATCG':
d[base] = dna.count(base)/float(len(dna))
return d
for _ in range(20):
dna = random_dna_sequence(100)
print dna, base_frequency(dna)
which would generate a result like:
AAGTGACGCCCGGTGCGAAAAACACGCGCCTCTCCGTAGTCATTCAGACT {'A': 0.26, 'C': 0.32, 'T': 0.18, 'G': 0.24}
AAGGATCTACTACCTCGTCTATTTGAACTACTGTAGTGCTACTAACTCAT {'A': 0.28, 'C': 0.24, 'T': 0.34, 'G': 0.14}
TCCACTTCTTGGTCCTGAACACCTGCAATCACCTCTTACATCGTGCGACG {'A': 0.2, 'C': 0.36, 'T': 0.28, 'G': 0.16}
AATCTCCGGTGTGTCCGCTACGGAGGTTAGGGCACTCCGTGGGAAAGCTC {'A': 0.18, 'C': 0.26, 'T': 0.22, 'G': 0.34}
GCGTAGTTCGCATTGATTAACATAGTGGCGACCATAGACTTCTATTATCG {'A': 0.26, 'C': 0.2, 'T': 0.32, 'G': 0.22}
AAGTGAACCTGGACTGGGTGGATCGTCTCCCTCGTCCGGTCCTTGGTAGC {'A': 0.14, 'C': 0.28, 'T': 0.26, 'G': 0.32}
ATGACGATGACGATCATCGTCAACGCGCGTCGCGCACACTGCATATCCAA {'A': 0.28, 'C': 0.32, 'T': 0.18, 'G': 0.22}
GTGCATACCGGTGCGCGCGTGCGCTAGGTATTGGAATGCTACGCTTAACC {'A': 0.18, 'C': 0.26, 'T': 0.24, 'G': 0.32}
GCCCGCGTGCCGCCAAGGGATGGGGAGAGTATTTTCGCCCCCTAAGTGCC {'A': 0.16, 'C': 0.32, 'T': 0.18, 'G': 0.34}
TCAAGATTCTCCTAAATATATAATGATCATCCGTTGTCATTCTGCGGACT {'A': 0.28, 'C': 0.22, 'T': 0.36, 'G': 0.14}
TGTTTTAGCCCTGTAGCCGGACTACGAAGTTTTAGGCGCCCAGATTAAGG {'A': 0.22, 'C': 0.22, 'T': 0.28, 'G': 0.28}
AGACGAGCTTTCAAGTTCTTGAATCACTACCTTTGACGTCGAGTGTAAGG {'A': 0.26, 'C': 0.2, 'T': 0.3, 'G': 0.24}
TCGCATTGTAAATAGGAACCTGAAACCTGCCAAGGAGATACAGTCTAAAT {'A': 0.38, 'C': 0.2, 'T': 0.22, 'G': 0.2}
CATCCGTGTGGTAACAGTTAATGCCGGGCTCACCCTCAGGTGTGAAGGAT {'A': 0.22, 'C': 0.24, 'T': 0.24, 'G': 0.3}
ACCAAGACATACCTTAAGGCCCACGCGTACAAGTCACGCTCTCAATACGG {'A': 0.32, 'C': 0.34, 'T': 0.16, 'G': 0.18}
CGTCGTTGGTATTCAGAAAACGCTAGCACATATGGTGCCCAGTCAAAGGA {'A': 0.3, 'C': 0.22, 'T': 0.22, 'G': 0.26}
CGTCATTGCACCAAGTGTGGTACTTTGGGGACGTGAGGTAACAATCCCTG {'A': 0.22, 'C': 0.22, 'T': 0.26, 'G': 0.3}
TGGTCCCTGTTTCTCCATTCCGCGTCCATCGTGCGTTCGTCCTTTAAAGT {'A': 0.1, 'C': 0.32, 'T': 0.38, 'G': 0.2}
AATTCACTCTTTTAACGATGGAAACGGGCGTTTGTAGTGTGCCACTAACC {'A': 0.26, 'C': 0.22, 'T': 0.3, 'G': 0.22}
CCTTGTATACCCCACATGAAGAATGGGCCTGACATCAATAATCTTTAGAT {'A': 0.32, 'C': 0.24, 'T': 0.28, 'G': 0.16}
A simple way using R
# define which bases will make up your sequences
bases <- c(rep('A', 5), rep('C',5), rep('G',5), rep('T',5))
# set how many sequences you want to produce
numOfSeqs <- 10
# initialize empty object
seqs <- rep (NA, numOfSeqs)
# populate the object by shuffling and joining your bases
for (i in 1:numOfSeqs){
seqs[i] <- paste(sample(bases, length(bases)), collapse = '')
}
Then you can do what you want with object seq.
Clearly, if you need to produce a very large number of sequences, you have to find a way to print them to file or it will fill the memory.
One option is to use PhyloSim
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks alot.Anyway,is there any software to do this without writing a code for it?