Hello,
please i have 1000 motifs (octamer) and i want determine the total number of each motifs in the whole genome of arabidopsis thaliana .Can you help me please, know you a tools to do this or what can i do ? I have tried RSAT tools but the compilation take a lot of time (i have submitted my work since 2 days but I did not receive the results .
thanks you very much ?
I am no biologist, but I am supposing octamer motifs are just 8 particular bases (ie ATTCGTGT). You can download the whole genome (its about 135 Mb), and then run a simple program that counts the number of motifs matching yours (This is like counting the kmers for k=8).
You could do it like this in C:
#include <stdlib.h>
#include <stdio.h>
int main(){
char myOctamer[8] = "ATTCGTGT"; // Your octamer here
FILE * genomeFasta = fopen64("file.txt", "rt");
char c;
int totalFound = 0, current = 0;
while(!feof(genomeFasta)){
c = fgetc(genomeFasta);
if(c != '\n'){
if(c == myOctamer[current]){
current++;
if(current == 8){
current = 0;
totalFound++;
}
}else{
current = 0;
}
}
}
fprintf(stdout, "Found %d occurrences.\n", totalFound);
return 0;
}
This works for only one octamer, but generalizing it to n octamers would not be difficult.
Notice this will not work for overlapped sequences. And in case you actually use it, I recommend compiling with D_FILE_OFFSET_BITS=64 to be able to handle large sequences (over 2GB).
Hope this helps and that I am not too far from the main point,
Esteban
Woah, thats a really cool technique i've never seen before. You don't store more than a byte at a time or do any copying, when comparing to the barcode. Awesome to see more C programmers on the forum :)
Unfortunately, the lack of being able to support overlapping sequences could be a big big issue depending on the barcode. For example, this code wouldn't find the barcode "AAC" in the genome "....AAAC....", no matter what came before or after in the genome. This makes it unsuitable for this sort of application -- however, I learnt something new, so still awesome :)
Thank you for your feedback! Yes, its more of an illustrative code and would probably not be used in real applications (at least, as it is). I thought it could have been helpful for the original poster, and in case he had needed it he could have developed it or ask for further help.
compseq if EMBOSS can give you the number of times every possible octamer appears in a sequence. Can be run via galaxy. Make sure to set all frames in the parameters.
thank you very much to all of you, your response is very helpful.