Entering edit mode
6.5 years ago
shira.zaltsman
▴
20
Dear all,
I have 50 full genomes of a species of bacteria. I would like to find an important motif that all of them have in common using meme but each one of the sequences is 4,000,000 bp long. Let's say I want to compare 12 sequences, how would you suggest me to cut the sequences so it will run on a reasonable time? What size of a window should I choose for comparing those 12 sequences 4 million bp long each?
Could you help me, please? Thank you, Shira.
Using AWK scripts, you could identify all possible motifs of a certain length (e.g. 12-mer) in each genome and then store these and their frequencies in an indexed array, with 1 indexed array for each genome. The arrays would have, for example, the motif as key and the value would be the frequency. After that, it would be a matter of comparing the frequencies of each motif across each indexed array.
A similar idea would be to first generate all possible 12-mer motifs of ATGC and then count their frequencies in each genome. Then, at least, you'd have the same number of motifs (and in the same order) in each indexed array.
This would actually be very rapid. Working with a 3.5 megabase bacterial genome, I can identify all possible motifs of up 20-mer in a matter of seconds.
Does your solution use MEME? I would be happy if you could refer me to the example of the algorithm you suggested because it is unfamiliar to me and I did not really understand how it works. thank you for your help
No, unfortunately my solution is of my own hands and coding, but apparently it's already more efficient than MEME for large files. My codes even work fine on the human genome, processing it for k-mers in minutes. I have wanted to release this as a program but have had no time.
I posted my suggestion as a comment because it did not directly answer your question but more gave you a different idea. It may prove difficult to help you implement my code remotely. I had hoped that, on the off chance, you were familiar with AWK indexed arrays.