Question

Code for splitting long string into Genbank record gives error

0

Entering edit mode

5.9 years ago

toth.joe ▴ 60

I am trying to read a long amino acid string into a Biopython Genbank record object so that a Genbank file can be written. Here is a truncated example

FEATURES Location/Qualifiers
CDS 687..3158
/gene="AXL2"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML"

My code reads a csv file with the data I want put into the Genbank file

        feature = Feature()
        feature.key = "CDS"
        feature.location = "1..{}".format(len(row['DNA']))
        feature.qualifiers = ["/translation=", "{}".format(row['Seq'])]
        container.features.append(feature)  
        with open(row['FullCloneName'] + '.gb', 'w') as output_file:
            output_file.write(str(container))

However, I get this error:

File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Record.py", line 631, in __str__
if no_space_key in qualifier.key:
AttributeError: 'str' object has no attribute 'key'

Can someone explain how the Feature method from Genbank Record source takes information from the Qualifier method? My input string has no breaks. It is a long 120 character string. How does this method break up the long string to format it for the Genbank file? Do I need to break up the string with a split character ','?

class Feature(object):
604 """Hold information about a Feature in the Feature Table of GenBank record.
605
606 Attributes:
607 - key - The key name of the feature (ie. source)
608 - location - The string specifying the location of the feature.
609 - qualifiers - A listing Qualifier objects in the feature.
610
611 """
612
613 - def __init__(self):
614 """Initialize."""
615 self.key = ''
616 self.location = ''
617 self.qualifiers = []
618
619 - def __str__(self):
620 """Return feature as a GenBank format string."""
621 output = Record.INTERNAL_FEATURE_FORMAT % self.key
622 output += _wrapped_genbank(self.location, Record.GB_FEATURE_INDENT,
623 split_char=',')
624 for qualifier in self.qualifiers:
625 output += " " * Record.GB_FEATURE_INDENT
626
627 # determine whether we can wrap on spaces
628 space_wrap = 1
629 for no_space_key in \
630 Bio.GenBank._BaseGenBankConsumer.remove_space_keys:
631 if no_space_key in qualifier.key:
632 space_wrap = 0
633
634 output += _wrapped_genbank(qualifier.key + qualifier.value,
635 Record.GB_FEATURE_INDENT, space_wrap)
636 return output
637
638
639 -class Qualifier(object):
640 """Hold information about a qualifier in a GenBank feature.
641
642 Attributes:
643 - key - The key name of the qualifier (ie. /organism=)
644 - value - The value of the qualifier ("Dictyostelium discoideum").
645
646 """
647
648 - def __init__(self):
649 """Initialize."""
650 self.key = ''
651 self.value = ''

biopython genbank • 2.0k views

ADD COMMENT • link 5.9 years ago by toth.joe ▴ 60

0

Entering edit mode

Might be time to call in the experts @ Peter

ADD REPLY • link 5.9 years ago by Joe 22k

score 0 · Answer 1 · 2019-02-23

0

Entering edit mode

5.9 years ago

Peter 6.0k

The SeqRecord approach is intended to be more 'high level' with less of the file format details exposed directly. The GenBank specific Record object approach is quite 'low level' with lots of details you have to do yourself. But the immediate problem is you need to use as list of Qualifier objects, something like this:

from Bio.GenBank.Record import Record, Feature, Qualifier

record = Record(...)

for row in ...:
    feature = Feature()
    feature.key = "CDS"
    feature.location = "1..{}".format(len(row['DNA']))
    # feature.qualifiers should be a list of Qualifier objects:
    feature.qualifiers = [
        # These values do not need quoting:
        Qualifier("/transl_table=", "1"),
        Qualifier("/codon_start=", "1"),
        # If the value needs double quotes, you must add them:
        Qualifier("/translation=", '"%s"' % row['Seq']),
    ]
    with open(row['FullCloneName'] + '.gb', 'w') as output_file:
        output_file.write(str(record))

Note your example has every CDS starting at base one, but it looks like you are making minimal GenBank files each with only one CDS.

ADD COMMENT • link 5.9 years ago by Peter 6.0k

0

Entering edit mode

Thanks for clarifying the source code. I entered the format as you suggest but still get an error.

feature = Feature()
        feature.key = "CDS"
        feature.location = "1..{}".format(len(row['DNA']))
        feature.qualifiers = [
            Qualifier("/translation=", "%s % row['Seq']"),
            ]

File "csv2gb_gbrecord.py", line 84, in main
Qualifier("/translation=", "%s % row['Seq']"),
TypeError: __init__() takes exactly 1 argument (3 given)

I'm still trying to understand the Qualifier class. I tried this approach but still can't get the Qualifier data into the feature object:

        feature = Feature()
        feature.key = "CDS"
        feature.location = "1..{}".format(len(row['DNA']))
        qualifier = Qualifier() 
        Qualifier.key = "/translation="
        Qualifier.value = "{}".format(row['Seq'])
        feature.qualifiers = [qualifier]    
        container.features.append(feature)
        print (feature)
        print (Qualifier.key, Qualifier.value)
        print feature.qualifiers

There is no translation line in the Genbank file written by the script. The Qualifier attributes were read correctly, but not passed to the Feature method. The dot means "no data" for Genbank files. Here is the terminal output from the print statements.

CDS             1..375  
                 .
  
('/translation=', 'ERNDAYGHFIS')
[<bio.genbank.record.qualifier object="" at="" 0x7f3fb1f6a6d0="">]

ADD REPLY • link 5.9 years ago by toth.joe ▴ 60

0

Entering edit mode

You need Biopython 1.73 for this to work:

qualifier = Qualifier("/translation=", '"%s"' % row['Seq'])

As you worked out, on older versions you must use:

qualifier = Qualifier()
qualifier.key = "/translation="
qualifier.value = '"%s"' % row['Seq']

ADD REPLY • link 5.9 years ago by Peter 6.0k

score 0 · Answer 2 · 2019-02-23

Thanks for the help everyone. I was able to get the Genbank feature writer to work. It wasn't a problem with long strings, rather I had to pass the Qualifier key and value attributes correctly to the Feature method.

container = Record()
container.locus = row['Sample']
container.size = len(row['DNA'])
container.residue_type="DNA"
container.data_file_division="PRI"
container.date = (datetime.date.today().strftime("%d-%b-%Y")) # today's date
container.definition = row['FullCloneName']
container.accession = [row['Vgene']]
container.comment = 'project xyz'
container.version = getpass.getuser()
container.keywords = [row['ProjectName']]
feature = Feature()
feature.key = "CDS"
feature.location = "1..{}".format(len(row['DNA']))
Qualifier.key = "/translation="
Qualifier.value = '"{}"'.format(row['Seq'])
feature.qualifiers.append(Qualifier)
container.features.append(feature)

/# Save as GenBank file
with open(row['ProjectName'] + '_' + row['FullCloneName'] + '.gb', 'w') as output_file:
output_file.write(str(container))