Entering edit mode
8.6 years ago
Kevin_Smith
▴
10
In the sequence file, each line contains a single DNA sequence. I will like to try just a round of greedy algorithm starting from the first sequence and the first position. The goal is to find the motif length of 7 from the given DNA sequences in the file.
The input for the program should be the sequence file name.
The output should include:
- the sites or kmers (each of them is a 7mer sequence from a sequence)
- the PPM (position probability matrix) for ATCG
- and the total information content of the PPM (for 7-mers).
For example :
Input
TCTGAGCTTGCGTTATTTTTAGACC
GTTTGACGGGAACCCGACGCCTATA
output
kmers:
TTCCT TTGCG
PPM:
A 0.091 0.091 0.091 0.091 0.091
T 0.727 0.727 0.091 0.091 0.545
total information content: 22.47
Who can help me with a python script. Thank you very much!