Regular expression matching with Python and biopython SeqIO
1
0
Entering edit mode
10.0 years ago
Ian 6.1k

After many years of using Perl I am starting to learn Python. As an example I want to perform regular expression matching in sequences extracted from a FASTA file. The FASTA files being parsed with Biopython's SeqIO module. In the following code re.findall fails to find iupac in seq_record.seq, however if the latter is replaced with a string, e.g. 'TTAATT', a match is found. Error = TypeError: expected string or buffer.

# biopython
from Bio import SeqIO
# regex library
import re

# file with FASTA sequence
infile = "fasta.fa"

# pattern to search for
iupac = "taat"

# look through each FASTA sequence in the file
for seq_record in SeqIO.parse(infile, "fasta"):
    print "Sequence ID: ", seq_record.id, "; ", len(seq_record), "bp"
    print seq_record.seq

    # scan for IUPAC; re.I makes search case-insensitive
    matches = re.findall( iupac, seq_record.seq, re.I)
    if matches:
        print "Matches = ", len(matches)

Thanks for any guidance!

regular-expression python biopython • 9.0k views
ADD COMMENT
0
Entering edit mode

Hey!

How do I get to print the co-ordinates of the match?

ADD REPLY
4
Entering edit mode
10.0 years ago
Peter 6.0k

The Biopython Seq object is string-like, but is not a string. Replace re.findall( iupac, seq_record.seq, re.I) with re.findall( iupac, str(seq_record.seq), re.I)

ADD COMMENT
0
Entering edit mode

Thank you! I thought I had already tried that, but it is now working.

ADD REPLY

Login before adding your answer.

Traffic: 2567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6