Hi all,
So I'm working on this problem of finding the most frequent 6-nucleotide long patterns from a given DNA file which is a standard FASTA file looking like:
>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides>
>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides>
Now I have these sequences in a file and I have a method for counting the most frequent ones, namely taking each sequence and copy-paste it in a script that uses the Counter function from python library:
>>> from collections import Counter
>>> protein = "ACGTACTAGACAGAGAGAGAG"
>>> Counter(protein[i:i+6] for i in range(len(protein)-5))
Counter({'AGAGAG': 3, 'GAGAGA': 2, 'ACGTAC': 1, 'CGTACT': 1, 'ACAGAG': 1, 'AGACAG': 1, 'TACTAG': 1, 'TAGACA': 1, 'CTAGAC': 1, 'CAGAGA': 1, 'GTACTA': 1, 'ACTAGA': 1, 'GACAGA': 1})
I modified the snipper a little so now it asks me for the sequence and I just copy-past it in. Biopython has a FASTA parser: (Source: the biopython cookbook)
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
But using that and then doing Counter on seq_record.seq just makes things ugly. My question is that can anyone tell me how to modify that little snippet of code that I have so that it can read from a file and do that for each sequence present? TO be more precise, how do I change that biopython snipper so that Counter function gives me the same result as the Counter used above when I hand-input the sequence. Thanks!
thanks @DK, I've tried somerhing like that, but here's the problem using COunter on the seq object gives me something like:
try casting seq object as string like in the script I wrote.
@DK: you ROCK!!!! thanks a lot, wow i've been trying to work this out for a long time and it finally worked out.