Biopython: renaming SeqRecords using dictionary values
2
0
Entering edit mode
9.0 years ago
cdarwin • 0

(Disclaimer: self-teaching; knowledge is minimal)

Hello all.

I am trying to change the record IDs for a batch of sequences using some metadata. I have two files: the metadata in a text file (tab-delimited) and it is a simple format (file1):

proteinID    organismID
string1      string2

File 2: a fasta file with the proteinID as the leading string after the >

>proteinID...

What I want to do is rename the sequences in File 2 using the correspondence from File 1.

>organismID...

So far, I have created a dictionary from file 1 using the protein IDs as the key

id_match_dict = {}
with open('file1.txt') as id_match:
    for line in id_match:
        (key,val) = line.strip("\n").split("\t")
        id_match_dict[str(key)] = val

This has worked well so far. Now I am trying to use this dictionary to modify the id of the SeqRecord objects using BioPython (record.id). My attempts at this have been really bad and don't even want to post what I have written. Suffice it to say, I am at a loss at this point. Could anyone help me on this? (or even point me in the right direction- I have no clue how to approach this problem)

[Please let me know if I need to provide more information, I am trying to keep this brief]

Thank you in advance!

python biopython fasta • 3.6k views
ADD COMMENT
0
Entering edit mode
from Bio import SeqIO

import deepcopy

handle = open('file1', "r")
handle2 =open('file2',"rU')

For each_line in handle:
    ***storing in dictionary goes here***
      Suppose dictionary is Id_mapper
for record in SeqIO.parse(handle, "clustal"):
    Record_mod = deepcopy(record)
    Record_mod.id = id_mapper[record.id]
    SeqIO.write(record_mod,handle2,"fasta")

Try this code...

ADD REPLY
0
Entering edit mode

I have a Galaxy tool which would do this nicely for you, https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_rename - written in Python but currently it uses the Galaxy FASTA parser rather than the Biopython one.

ADD REPLY
3
Entering edit mode
9.0 years ago
Peter 6.0k

Something like this, based on your start:

# Load name mapping as a dictionary
id_match_dict = {}
with open('file1.txt') as id_match:
    for line in id_match:
        if line.strip():
            old, new = line.strip("\n").split("\t")
            id_match_dict[old] = new

There are many ways to do the next bit, this uses plain strings and outputs the FASTA file with no line wrapping:

from Bio.Seq.FastaIO import SimpleFastaParser

in_filename = "old_names.fas"
out_filename = "new_names.fas"

with open(in_filename) as in_handle:
    with open(out_filename, "w") as out_handle:
        for title, seq in SimpleFastaParser(in_handle):
            name, descr = title.split(None, 1)
            name = id_match_dict[name]
            out_handle.write(">%s %s\n%s\n" % (name, descr, seq))

NOTE: This will give a KeyError if a name not in your table is found. What would you want to happen? Leave the old name as is?

ADD COMMENT
0
Entering edit mode

This is great, Peter! Thank you so much for your help - I really appreciate it.

ADD REPLY
1
Entering edit mode
9.0 years ago

I don't have a Biopython answer, although it should be pretty straightforward. I would suggest using pyfaidx for this:

ADD COMMENT
0
Entering edit mode

Excellent! Thank you so much for your response. I have never heard of the pyfaidx module and am glad to discover it.

ADD REPLY

Login before adding your answer.

Traffic: 1365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6