Hi guys, i have a code looking like this:
import HTSeq
reference = open('genome.fa','r')
sequences = dict( s.name, s) for s in HTSeq.FastaReader(reference))
out = open('/home/istolarek/moleculoNA12878/homopolymers_in_ref','w')
def find_all(a_str,sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub)
homa = 'AAAAAAAAAA'
homc = 'CCCCCCCCCC'
homg = 'GGGGGGGGGG'
homt = 'TTTTTTTTTT'
for key,line in sequences.items():
seq = str(line)
a= list(find_all(seq,homa))
c = list(find_all(seq,homc))
g = list(find_all(seq,homg))
t = list(find_all(seq,homt))
for i in a:
## print i,key,'A'
out.write(str(i)+'\t'+str(key)+'\t'+'A'+'\n')
for i in c:
out.write(str(i)+'\t'+str(key)+'\t'+'C'+'\n')
## print i,key,'C'
for i in g:
out.write(str(i)+'\t'+str(key)+'\t'+'G'+'\n')
for i in t:
out.write(str(i)+'\t'+str(key)+'\t'+'T'+'\n')
out.close()
I used HTSeq to open the reference. What it does - it looks for simple homopolymers of length 10 and outputs start position, chromosome and type (A,C,T,G,). Now what I would like to change is finding the length of each homopolymer. Since, right now if a homopolymer has a length 15, my script can't tell it. I was thinking of doing it by some sort of regex, namely: find at least 10 bp like now and compute length by adding +1 to it until the next base isn't like those in homopolymer.
Any suggestions how to use a regex to do it in python?
Not sure this is precisely what was asked for. The regex for "at least 10 A" is A{10,}.
The search here needs to be greedy,
A{4,10}
will find all substrings matching that pattern, so longer stretches will generate many hits:A{10,}
works with a few quick tests. OP one issue with using a regex here is that you lose any information about the position of the homopolymers. You can easily generalize the search:The above code can be adapted to fill a dictionary if you really wanted, but I generally find it easier to enforce a fixed order for my alphabets if all I am doing is counting and so on. There is a downside to using this method: the memory usage could potentially be massive.
re.findall
returns a list of substrings matching the provided pattern, when all you're doing is looking for the length of each one. This means a potentially massive list will be generated each time. You would be better off iterating through the string in one shot, counting for all characters as you go. A regex can be handy and quick to write, but there are cases where the few moments less you spent writing the code can result in much worse performance than if you had written some loops.