I am currently working on a bioinformatics problem where I need to lookup and count the location and count of occurences of 4000-ish 5 character long patterns in each sequence of a fasta file of 700GB. This fasta file has aligned sequences of one virus. To start with, I want to just look up 6 patterns.
The characters are limited to ATGC for the patterns and the fasta.
The input to the code is just the patterns I want to look up and the aligned sequences of the virus.
The output should be a file which is of a matrix format with #rows = num of patterns, # columns = #num of characters in any given fasta sequence (should be same for all sequences in the fasta as it is an aligned file). Each value of the matrix should be the % of sequences that have that pattern starting in that coordinate.
For example- input:
seq1 - ATGCCAT
seq2 - ATG-GAT
seq3 - GGAATTC
seq4 - -ATATCC
pattern #1 - AT
pattern #2 - GC
NUMBER OF SEQUENCES = 4
-
= any char which has not been sequenced/ not in ATGC total number of sequences at any particular coordinate should be = number of seq - num of invalid sequences at that coordinate
output matrix:
Pattern 1 2 3 4 5 6
AT 2/3 1/4 0 2/3 0 2/4
GC 0 0 1/3 0 0 0
let me explain the value 2/3 in position 1*1 of the matrix - in the first coordinate of 2 out of the 4 sequences, the pattern AT starts. 1 sequence has an invalid - character and hence will be excluded from the total number of considered sequences.
I am a newbie to bioinformatics applications and hence am unaare if this is common problem but I am yet to discover a straghtforard solution through my research so far. I recently came across the aho-corasick algortihm for multi-pattern matching with strings through large text files as a very memory efficient and quick means of solving such issues. I am interested in scaling this search operation to many more patterns in the future as well. I am looking for advice on the following:
- Are there any well- established tool for such and operation already? if so, what are they?
- Am I right in attempting to use aho-corasick for this application?
- How can I efficiently apply this algorithm to a largned fasta with aligned sequenced? can I read the fasta in chunks to make it quicker?
- Should I code this in something other than Python? (I have been using Python so far but I am open to others as well).
Open to any suggestions as I am new to big data processing as well
Thanks pierre, does this accept fasta file formats and will it be able to handle large files.