Question

How to find if a string contains non overlapping block of repeated units? Python

0

Entering edit mode

9.2 years ago

IrK ▴ 100

Hi guys,

I am wondering if anyone could advice the package /strategy any hint to the following problem. I have a list of strings, which are of different length ['blabalaa..11', 'anotherstring......60', 'anotherlength....25' and et c] I have to take one string and look if it has any sequentially repeated blocks, for example String1 (see below) contains "kilimanjaro" repeated block. However, in another string it can be something else or it might NOT contain any repetitions at all.

String1=’njarokilimanjarokilimanjarokilimanjarokilimanjarokili’ String1=’njarokilimanjarokilimanjarokilimanjaro*kilimanjarokili’

I want to split the string into blocks of repetitions it contains of, desired output:

’njaro kilimanjaro kilimanjaro kilimanjaro kilimanjaro kili’

' no repetitions are found' and etc

As you can see the beginning and end of the string can contain part of the required information. So basically repeated blocks are hidden somewhere in a middle of a string,

I dont know the pattern of this repetitions, for each string in my list it is different,
I dont know the length of this repetitions,
I dont know from which position the repetitions starts

Is it possible to solve such problem?

Thank you in advance

python repeated units string • 6.9k views

ADD COMMENT • link updated 9.2 years ago by Alex Reynolds 36k • written 9.2 years ago by IrK ▴ 100

1

Entering edit mode

I assume you want the largest substring which is repeated? "njarokilimanjarokilimanjarokilimanjarokilimanjarokili" could also be: "njarokili man jarokili man jarokili man jarokili manjaro kili" if we found either man or njarokili.

I would suggest starting with a sliding window approach per string, brute force searching starting from the largest substring:

Some badly indented sample code, untested but as a pointer to what you could optimize:

for i in range(length(string))
    lengthsubstr = (length(string) - i)
    substrslist = [string[0:lengthsubstr+1], [1:]] #Python range goes to end exclusively.
    for substr in substrslist:
        count = string.count(substr)
         if count > 1:
#Found a recurring pattern because count > 1. Perhaps you want to be more stringent and count has to be > n
            res = substr          
            break
#If no count > 1 found, continue loop with 3 substrings of length string - 2

Let me know if you can't get this working and I'll have another look at it later ;)

ADD REPLY • link 9.2 years ago by WouterDeCoster 48k

score 2 · Answer 1 · 2016-03-24

2

Entering edit mode

9.2 years ago

Michael 55k

In the bioinformatics context, these are called tandem repeats, what you describe is a classical bioinformatics problem and is solved with string algorithms like TandemRepeatFinder https://tandem.bu.edu/trf/trf.html

ADD COMMENT • link 9.2 years ago by Michael 55k

0

Entering edit mode

sorry to get back to my post so late.

I have checked the TRF, which works Ok with a small size input, however, when it comes to really big data file like 200MB, I haven't gotten any result, even if I partition my file into smaller pieces I also checked Mreps no file output possible., T-reks -output looks good to me, but again it is collapsing with big size files .Personally, I prefer the TRF output, but I have to find a solution for the MB file size. Although at they website they say: 'analyzing sequences on the order of .5Mb in just a few seconds'

ADD REPLY • link 9.2 years ago by IrK ▴ 100

0

Entering edit mode

I ran our 700 MB genome fasta through TRF with no problems. How did you run it and on what type of machine?

ADD REPLY • link 9.2 years ago by Michael 55k

0

Entering edit mode

I was using a web interface on windows machine. However, I tried to download the copy of the program on Linux 64 bits, but couldn't make it work (I just downloaded binary file, which looks like trf409.linux64, and then just right click -> permission -> allow executing file as program, and run it as suggested on their website, please correct me if I have done the installation incorrectly ).

also how long doest it take to run 700MB (approximately) ? Thanks

ADD REPLY • link 9.2 years ago by IrK ▴ 100

0

Entering edit mode

I am just running it again, because I couldn't remember. Running it with standard parameters from https://tandem.bu.edu/trf/trf.unix.help.html and -h to not produce html ouput. My estimate from intermediate output is that it will take about 3 - 5 days to complete. trf can only use 1 cpu at a time, but it could be run via GNU parallel to split the input into chromosomes and finish faster.

ADD REPLY • link 9.2 years ago by Michael 55k

0

Entering edit mode

I managed to run the command from Linux (just needed to add .\ in front of file name -> .\trf). And it works and it is very fast, I needed 15 minutes for 300MB of data, with -h option.

once again thank you for suggesting the tool.

ADD REPLY • link 9.2 years ago by IrK ▴ 100

score 1 · Answer 2 · 2016-03-24

I think the following pseudocode would work:

Given string S of length N, calculate n(n+1)/2 substrings of length 1, 2, 3 and so on up to N.
For each substring, slide over S until the first match.
Store the substring and an array of counters and start positions in a hash table, initialized with the first match at start position i, e.g.:

{ substring_a : [ { counter : 1 , start_position : i } ] }
Slide over S. If there is a partial prefix match before the length of the substring past start position i, then remove this substring from consideration.
If there is a full match at the position after the length of the substring, increment the counter. Else, keep sliding over S until another match is found. If a match is found, add another counter and start position j to the array for that substring:

{ substring_a : [ { counter : 1 , start_position : i }, { counter : 1 , start_position : j } ] }
Repeat step 4 for the given substring: slide over S and test. If there is a partial match before the length of the substring, remove from consideration. Else, if there is a full match after the length of the substring, increment the last counter.

Repeat until there are no more characters in S to query.

What you get is a hash table of substrings and, for each substring, an array of matches that store the number of full, non-overlapping repeats at a given start position. You would search through this table for the longest substring with the largest counter value. You could modify step 5 to store the length of the repetition as soon as a mismatch is found, which I'll leave out.