pattern matching tools
4
1
Entering edit mode
8.6 years ago
jaqy ▴ 20

Hello, please i have 1000 motifs (octamer) and i want determine the total number of each motifs in the whole genome of arabidopsis thaliana .Can you help me please, know you a tools to do this or what can i do ? I have tried RSAT tools but the compilation take a lot of time (i have submitted my work since 2 days but I did not receive the results . thanks you very much ?

sequence genome • 2.2k views
ADD COMMENT
1
Entering edit mode

thank you very much to all of you, your response is very helpful.

ADD REPLY
2
Entering edit mode
8.6 years ago
estebanpw ▴ 30

I am no biologist, but I am supposing octamer motifs are just 8 particular bases (ie ATTCGTGT). You can download the whole genome (its about 135 Mb), and then run a simple program that counts the number of motifs matching yours (This is like counting the kmers for k=8).

You could do it like this in C:

#include <stdlib.h>
#include <stdio.h>

int main(){


    char myOctamer[8] = "ATTCGTGT"; // Your octamer here

    FILE * genomeFasta = fopen64("file.txt", "rt");

    char c;
    int totalFound = 0, current = 0;
    while(!feof(genomeFasta)){
        c = fgetc(genomeFasta);

        if(c != '\n'){
            if(c == myOctamer[current]){
                current++;
                if(current == 8){
                    current = 0;
                    totalFound++;
                }
            }else{
                current = 0;
            }   
        }


    }
    fprintf(stdout, "Found %d occurrences.\n", totalFound);
    return 0;
}


This works for only one octamer, but generalizing it to n octamers would not be difficult. Notice this will not work for overlapped sequences. And in case you actually use it, I recommend compiling with D_FILE_OFFSET_BITS=64 to be able to handle large sequences (over 2GB).

Hope this helps and that I am not too far from the main point, Esteban

ADD COMMENT
1
Entering edit mode

Woah, thats a really cool technique i've never seen before. You don't store more than a byte at a time or do any copying, when comparing to the barcode. Awesome to see more C programmers on the forum :)

Unfortunately, the lack of being able to support overlapping sequences could be a big big issue depending on the barcode. For example, this code wouldn't find the barcode "AAC" in the genome "....AAAC....", no matter what came before or after in the genome. This makes it unsuitable for this sort of application -- however, I learnt something new, so still awesome :)

ADD REPLY
1
Entering edit mode

Thank you for your feedback! Yes, its more of an illustrative code and would probably not be used in real applications (at least, as it is). I thought it could have been helpful for the original poster, and in case he had needed it he could have developed it or ask for further help.

Likewise, awesome to see more C programmers!

ADD REPLY
2
Entering edit mode
8.6 years ago

A kmer counter like khmer or Jellyfish can be used to obtain all octamer frequencies, then filtered for the subset of interest.

ADD COMMENT
2
Entering edit mode
8.6 years ago
Asaf 10k

compseq if EMBOSS can give you the number of times every possible octamer appears in a sequence. Can be run via galaxy. Make sure to set all frames in the parameters.

ADD COMMENT
0
Entering edit mode

thanks it's very simple and fast

ADD REPLY
0
Entering edit mode
8.6 years ago
jaqy ▴ 20

thank you very much to all of you, your response is very helpful.

ADD COMMENT

Login before adding your answer.

Traffic: 1606 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6