Question

Most Frequent Nucleotide/String Pattern In Dna Sequence Files Using Python Or Biopython

4

Entering edit mode

13.4 years ago

Dhillonv10 ▴ 110

Hi all,

So I'm working on this problem of finding the most frequent 6-nucleotide long patterns from a given DNA file which is a standard FASTA file looking like:

>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides> 

>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides>

Now I have these sequences in a file and I have a method for counting the most frequent ones, namely taking each sequence and copy-paste it in a script that uses the Counter function from python library:

>>> from collections import Counter
>>> protein = "ACGTACTAGACAGAGAGAGAG"
>>> Counter(protein[i:i+6] for i in range(len(protein)-5))
Counter({'AGAGAG': 3, 'GAGAGA': 2, 'ACGTAC': 1, 'CGTACT': 1, 'ACAGAG': 1, 'AGACAG': 1, 'TACTAG': 1, 'TAGACA': 1, 'CTAGAC': 1, 'CAGAGA': 1, 'GTACTA': 1, 'ACTAGA': 1, 'GACAGA': 1})

I modified the snipper a little so now it asks me for the sequence and I just copy-past it in. Biopython has a FASTA parser: (Source: the biopython cookbook)

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print seq_record.id
    print repr(seq_record.seq)
    print len(seq_record)

But using that and then doing Counter on seq_record.seq just makes things ugly. My question is that can anyone tell me how to modify that little snippet of code that I have so that it can read from a file and do that for each sequence present? TO be more precise, how do I change that biopython snipper so that Counter function gives me the same result as the Counter used above when I hand-input the sequence. Thanks!

biopython python fasta • 11k views

ADD COMMENT • link updated 13.4 years ago by brentp 24k • written 13.4 years ago by Dhillonv10 ▴ 110

Ram · Answer 1 · 2011-11-29

5

Entering edit mode

13.4 years ago

Damian Kao 16k

Seems like you are almost there. You just have to combined the two pieces of code like this:

from Bio import SeqIO
from collections import Counter

for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    seq = str(seq_record.seq)
    cnt = Counter(seq[i:i+6] for i in range(len(seq)-5))

    #you can now do what you want with this counter object. 
    #you can turn it in to a dictionary by dict(cnt)
    #cnt.items() to convert to list of pairs (elements,count)
    #
    #for example:
    #
    #print ">" + seq_record.id
    #for subsequence, count in cnt.items():
    #    print subSequence + '\t' + str(count)

You probably do not have to cast seq_record.seq as a string, but sometimes it can be a problem if you don't. Read more about the python counter object here.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 13.4 years ago by Damian Kao 16k

0

Entering edit mode

thanks @DK, I've tried somerhing like that, but here's the problem using COunter on the seq object gives me something like:

Counter({Seq('CCTGTG', SingleLetterAlphabet()): 1, Seq('TCTGTG', SingleLetterAlphabet()): 1, Seq('TGGTGG', SingleLetterAlphabet()): 1, Seq('TTTCAG', SingleLetterAlphabet()): 1, Seq('TTTTGG', SingleLetterAlphabet()): 1, Seq('TTGCCT', SingleLetterAlphabet()): 1, Seq('AAGGTT', SingleLetterAlphabet()): 1, Seq('TAACAA', SingleLetterAlphabet()): 1, Seq('GGGGCA', SingleLetterAlphabet()): 1, Seq('GACCCC',

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.4 years ago by Dhillonv10 ▴ 110

0

Entering edit mode

try casting seq object as string like in the script I wrote.

ADD REPLY • link 13.4 years ago by Damian Kao 16k

0

Entering edit mode

@DK: you ROCK!!!! thanks a lot, wow i've been trying to work this out for a long time and it finally worked out.

ADD REPLY • link 13.4 years ago by Dhillonv10 ▴ 110

Ram · Answer 2 · 2011-11-29

3

Entering edit mode

13.4 years ago

brentp 24k

If I understand correctly, this should get you close:

from collections import Counter
from Bio import SeqIO

for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print seq_record.id
    seq = seq_record.seq
    c = Counter(seq[i:i + 6] for i in range(len(seq) - 5))
    print c

ADD COMMENT • link 13.4 years ago by brentp 24k

0

Entering edit mode

Thanks @brentp, I have the same problem with this snippet as the one above it, I get this:

Counter({Seq('CCTGTG', SingleLetterAlphabet()): 1, Seq('TCTGTG', SingleLetterAlphabet()): 1, Seq('TGGTGG', SingleLetterAlphabet()): 1, Seq('TTTCAG', SingleLetterAlphabet()): 1, Seq('TTTTGG', SingleLetterAlphabet()): 1, Seq('TTGCCT', SingleLetterAlphabet()): 1, Seq('AAGGTT', SingleLetterAlphabet()): 1, Seq('TAACAA', SingleLetterAlphabet()): 1, Seq('GGGGCA', SingleLetterAlphabet()): 1, Seq('GACCCC',

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 13.4 years ago by Dhillonv10 ▴ 110

0

Entering edit mode

add seq = str(seq_record.seq) as DK suggets

ADD REPLY • link 13.4 years ago by brentp 24k

0

Entering edit mode

@brentp, it turns out if you cast a string at seq_record.seq, it works out perfectly. So if you could edit your answer here: seq = (str) seq_record.seq