Question

Dna Pattern Matching In Python for Analysis

2

Entering edit mode

6.2 years ago

drenz ▴ 30

Hi Guys, I am new to the forum to come directly with a question / problem.

I want to write a program that counts the patterns like this. I have given such sequences as an example:

A: AAAAAAAAA

B: AAAAACAAG

C: GAAAACGAA

D: AAAAAAAAT

A patten is a group of four letters of the four groups. As an example, the first Patten would be "AAGA". Now I wonder how many times that happens in my entire Sequenze. So far I have only found programs that are looking for a given pattern. Does anyone have an idea how to program this or do you already know where this was done?

I'm very thankful for your help!

Best regards

sequence RNA-Seq • 6.0k views

ADD COMMENT • link updated 6.2 years ago by Kevin Blighe 88k • written 6.2 years ago by drenz ▴ 30

0

Entering edit mode

Is this a homework assignment? What have you tried?

ADD REPLY • link 6.2 years ago by Benn 8.3k

0

Entering edit mode

No, this is not a homework assignment. I'm getting into bioinformatics and the problem has been with me many times. I have already tried the following approach:

enter link description here

But I did not even get an idea. That's why I'm asking here if anyone has an idea.

ADD REPLY • link 6.2 years ago by drenz ▴ 30

0

Entering edit mode

Many thanks for your help! I'm watching this. I still wonder if the kmer-counter is the other way around. A pattern should be a group of four letters. Each of these four letters is from the group A, B, C, D. So the first letter from the first group is A, from the second group A, from the third group G and from the fourth group A. Together, these 4 letters are then a patter. Do you understand what I mean?

ADD REPLY • link 6.2 years ago by drenz ▴ 30

0

Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

score 6 · Accepted Answer · 2018-09-12

6

Entering edit mode

6.2 years ago

sacha ★ 2.4k

Something like this ?

Input fasta file : test.fa

>A
AAAAAAAAA
>B
AAAAACAAG
>C
GAAAACGAA
>D
AAAAAAAAT

Python counter

from Bio import SeqIO
import pandas as pd
from collections import Counter

df = pd.DataFrame()
for record in SeqIO.parse("test.fa", "fasta"):
    df[record.id] = list(str(record.seq))
df = df.transpose()

It returns a dataframe like this :

    0   1   2   3   4   5   6   7   8
A   A   A   A   A   A   A   A   A   A
B   A   A   A   A   A   C   A   A   G
C   G   A   A   A   A   C   G   A   A
D   A   A   A   A   A   A   A   A   T

Then you can count column string occurence with Counter

c = Counter([df[i].str.cat() for i in df])
# Output : Counter({'AAGA': 2, 'AAAA': 5, 'ACCA': 1, 'AGAT': 1})

Note : It only works if you have few sequence. Loading the dataframe with a huge fasta will fail with low memory. You also may want to see Position Weight Matrix

ADD COMMENT • link 6.2 years ago by sacha ★ 2.4k

0

Entering edit mode

Thank you so much! I think I got it now.

ADD REPLY • link 6.2 years ago by drenz ▴ 30

1

Entering edit mode

Is the previous code the expected result? So, if you have a large dataset, you can avoid creation of matrix. This is not efficient, but will work :

patterns = []
seq1 = str(next(SeqIO.parse("test.fa", "fasta")).seq)

for i in range(len(seq1)):
    pattern = list()
    for record in SeqIO.parse("test.fa", "fasta"): 
        pattern.append(record.seq[i])
    patterns.append("".join(pattern))

Counter(patterns)

ADD REPLY • link 6.2 years ago by sacha ★ 2.4k

1

Entering edit mode

Yes, that absolutely meets my expectations! I did not know the position weight matrix before. But I study biomathematics, so that was not a problem now. The advantage of this is also that I can directly spend the probabilities. Thank you, that was really useful and even worked!

ADD REPLY • link 6.2 years ago by drenz ▴ 30

score 4 · Accepted Answer · 2018-09-12

What you are looking for is 'kmer frequency'. There are lots of pre-existing tools to calculate this.

Our very own Alex Reynolds has this github repository for instance:

https://github.com/alexpreynolds/kmer-counter

Kmer frequency will give you the number of occurrences of ALL strings of k letters long. If you're only interested in particular kmer occurences, you can filter the data you get back. You may even be able to specify which kmers you want counted specifically.

The above is not a python solution, but there are plenty out there that are which you should be able to find easily now you have that key terminology.

Here's a tutorial you could follow if you wanted to code something yourself though.

score 4 · Accepted Answer · 2018-09-12

You could use a Python regular expression with a lookahead for your kmer of interest:

>>> import re
>>> a = 'AAAAACAAGAAA'
>>> [x.start() for x in re.finditer('(?=AAA)', a)]
[0, 1, 2, 9]
>>> [x.start() for x in re.finditer('(?=AAZ)', a)]
[]

The len() property of the list of "hit" start positions is the number or frequency of the kmer in question.

Consider also what you might need to do in Python, if you need to search the reverse complement of the query sequence.

Another possible approach would be to write a Python generator of all contiguous substrings of a query sequence (and, possibly, its reverse complement). Using a generator would reduce the memory overhead of making a list from all substrings.

One can iterate or walk through the output of a generator, one element at a time. You might compare the generator output with your kmer of interest and increment a counter, if there's a match.