Most Frequent Nucleotide/String Pattern In Dna Sequence Files Using Python Or Biopython
2
4
Entering edit mode
13.0 years ago
Dhillonv10 ▴ 110

Hi all,

So I'm working on this problem of finding the most frequent 6-nucleotide long patterns from a given DNA file which is a standard FASTA file looking like:

>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides> 

>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides>

Now I have these sequences in a file and I have a method for counting the most frequent ones, namely taking each sequence and copy-paste it in a script that uses the Counter function from python library:

>>> from collections import Counter
>>> protein = "ACGTACTAGACAGAGAGAGAG"
>>> Counter(protein[i:i+6] for i in range(len(protein)-5))
Counter({'AGAGAG': 3, 'GAGAGA': 2, 'ACGTAC': 1, 'CGTACT': 1, 'ACAGAG': 1, 'AGACAG': 1, 'TACTAG': 1, 'TAGACA': 1, 'CTAGAC': 1, 'CAGAGA': 1, 'GTACTA': 1, 'ACTAGA': 1, 'GACAGA': 1})

I modified the snipper a little so now it asks me for the sequence and I just copy-past it in. Biopython has a FASTA parser: (Source: the biopython cookbook)

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print seq_record.id
    print repr(seq_record.seq)
    print len(seq_record)

But using that and then doing Counter on seq_record.seq just makes things ugly. My question is that can anyone tell me how to modify that little snippet of code that I have so that it can read from a file and do that for each sequence present? TO be more precise, how do I change that biopython snipper so that Counter function gives me the same result as the Counter used above when I hand-input the sequence. Thanks!

biopython python fasta • 10k views
ADD COMMENT
5
Entering edit mode
13.0 years ago

Seems like you are almost there. You just have to combined the two pieces of code like this:

from Bio import SeqIO
from collections import Counter

for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    seq = str(seq_record.seq)
    cnt = Counter(seq[i:i+6] for i in range(len(seq)-5))

    #you can now do what you want with this counter object. 
    #you can turn it in to a dictionary by dict(cnt)
    #cnt.items() to convert to list of pairs (elements,count)
    #
    #for example:
    #
    #print ">" + seq_record.id
    #for subsequence, count in cnt.items():
    #    print subSequence + '\t' + str(count)

You probably do not have to cast seq_record.seq as a string, but sometimes it can be a problem if you don't. Read more about the python counter object here.

ADD COMMENT
0
Entering edit mode

thanks @DK, I've tried somerhing like that, but here's the problem using COunter on the seq object gives me something like:

Counter({Seq('CCTGTG', SingleLetterAlphabet()): 1, Seq('TCTGTG', SingleLetterAlphabet()): 1, Seq('TGGTGG', SingleLetterAlphabet()): 1, Seq('TTTCAG', SingleLetterAlphabet()): 1, Seq('TTTTGG', SingleLetterAlphabet()): 1, Seq('TTGCCT', SingleLetterAlphabet()): 1, Seq('AAGGTT', SingleLetterAlphabet()): 1, Seq('TAACAA', SingleLetterAlphabet()): 1, Seq('GGGGCA', SingleLetterAlphabet()): 1, Seq('GACCCC',
ADD REPLY
0
Entering edit mode

try casting seq object as string like in the script I wrote.

ADD REPLY
0
Entering edit mode

@DK: you ROCK!!!! thanks a lot, wow i've been trying to work this out for a long time and it finally worked out.

ADD REPLY
3
Entering edit mode
13.0 years ago
brentp 24k

If I understand correctly, this should get you close:

from collections import Counter
from Bio import SeqIO

for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print seq_record.id
    seq = seq_record.seq
    c = Counter(seq[i:i + 6] for i in range(len(seq) - 5))
    print c
ADD COMMENT
0
Entering edit mode

Thanks @brentp, I have the same problem with this snippet as the one above it, I get this:

Counter({Seq('CCTGTG', SingleLetterAlphabet()): 1, Seq('TCTGTG', SingleLetterAlphabet()): 1, Seq('TGGTGG', SingleLetterAlphabet()): 1, Seq('TTTCAG', SingleLetterAlphabet()): 1, Seq('TTTTGG', SingleLetterAlphabet()): 1, Seq('TTGCCT', SingleLetterAlphabet()): 1, Seq('AAGGTT', SingleLetterAlphabet()): 1, Seq('TAACAA', SingleLetterAlphabet()): 1, Seq('GGGGCA', SingleLetterAlphabet()): 1, Seq('GACCCC',
ADD REPLY
0
Entering edit mode

add seq = str(seq_record.seq) as DK suggets

ADD REPLY
0
Entering edit mode

@brentp, it turns out if you cast a string at seq_record.seq, it works out perfectly. So if you could edit your answer here: seq = (str) seq_record.seq

ADD REPLY
0
Entering edit mode

yup it worked as needed.

ADD REPLY

Login before adding your answer.

Traffic: 1492 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6