Question

biopython string object

0

Entering edit mode

5.3 years ago

bsp017 ▴ 50

Hi all,

I'm trying to incorporate a regular expression command in a biopython script. This prodcues an error:

AttributeError: 'str' object has no attribute 'id'

What I would like to do is to match a pattern within a Fasta file and replace the matching characters with other characters.

From this:

>BA_03462|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

To this:

>BA|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

Using the re module I can find and replace the pattern with this command:

matches = re.findall(r'_(.....)', str(seq_record))
for m in matches:
    change = str(seq_record), faa_filename.replace('_%s' % m, ' ')

The complete function is here:

   def change_string():
        with open('outfile_padded.fasta')as f:
            for seq_record in SeqIO.parse(f, "fasta"):
                    seq_record.id = seq_record.description = matches = re.findall(r'_(.....)', str(seq_record))
                    for m in matches:
                        change = str(seq_record), faa_filename.replace('_%s' % m, ' ')
        SeqIO.write(change, 'string.fasta', "fasta")
    change_string()

However the attribute error arises as biopython wants a string like object, but re wants a string. I've tried to modify the script but cannot find a way to please both modules.

Does anyone know a solution to this?

Thanks,

James

python --version Python 3.6.8 :: Anaconda, Inc. biopython==1.73 Red Hat 4.8.5-36

biopython fasta parse regular expressions • 2.4k views

ADD COMMENT • link updated 5.3 years ago by Joe 21k • written 5.3 years ago by bsp017 ▴ 50

1

Entering edit mode

Do you absolutely need to use python? Would it not be easier to just use sed? Also, why not use re.sub(..., count=0)?

ADD REPLY • link 5.3 years ago by Ram 44k

0

Entering edit mode

Building on RamRS's comment, why even use Biopython/SeqIO? Can't you just treat your data as a standard text file and blow through it line-by-line, avoiding any overhead from SeqIO.parse() (only really matters if your fasta is large)? I would also use sed for a quick turnaround.

ADD REPLY • link 5.3 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

While it is probably fine to do so in this case, I would contend that the better general advice is to always use a well trusted parser whenever possible...

ADD REPLY • link 5.3 years ago by Joe 21k

0

Entering edit mode

Yes it would probably be easier to use a sed or awk command. I was trying to keep this part of my pipeline to python to avoid having to go out of a single python script and also I want to learn more python.

Would the re.sub command aviod using findall and replace?

ADD REPLY • link 5.3 years ago by bsp017 ▴ 50

0

Entering edit mode

Find matches to a regular expression + substitute = re.sub is the first thing that comes to my mind, as the substitute operation is not complex enough to warrant a find/match followed by a bunch of steps. From a cursory glance at re documentation (I don't use python), it seems like the substitution argument can also be a method, which would address even complicated substitution problems. I see no reason to not use re.sub.

ADD REPLY • link 5.3 years ago by Ram 44k

score 4 · Accepted Answer · 2019-08-22

There are a couple of problems here I think.

Firstly, the error you're getting isn't saying what you think it is. It's saying that somewhere, you're trying to call the attribute id from an object which has no such attribute, not that there is an unexpected string or otherwise.

I'm guessing this has something to do with this line where there's a lot going on and kind of asking for trouble: seq_record.id = seq_record.description = matches = re.findall(r'_(.....)', str(seq_record))

All you really need to do is the following (assuming your fasta formatting never deviates). I've also changed your regex to be a bit more stringent.

import re, sys
from Bio import SeqIO

regex = re.compile(r"(_\d{5})")

for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    match = regex.search(rec.description).group()
    rec.description = rec.description.replace(str(match), "")
    print(">" + rec.description)
    print(str(rec.seq))

Input:

>BA_03462|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

Script: python scriptname.py sequences.fasta

Output:

>BA|gyrB Brenneria alni strain NCPPB
ATGTCGAATTCTTATGACTCCTCAAGTATCAAGGTATTGAAAGGGCTGGATGCGGTACGT

Or even more simply, using re.sub:

import re, sys
from Bio import SeqIO

regex = re.compile(r"(_\d{5})")

for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    rec.description = re.sub(regex, "", rec.description)
    print(">" + rec.description)
    print(str(rec.seq))