(Disclaimer: self-teaching; knowledge is minimal)
Hello all.
I am trying to change the record IDs for a batch of sequences using some metadata. I have two files: the metadata in a text file (tab-delimited) and it is a simple format (file1
):
proteinID organismID
string1 string2
File 2
: a fasta file with the proteinID as the leading string after the >
>proteinID...
What I want to do is rename the sequences in File 2 using the correspondence from File 1.
>organismID...
So far, I have created a dictionary from file 1 using the protein IDs as the key
id_match_dict = {}
with open('file1.txt') as id_match:
for line in id_match:
(key,val) = line.strip("\n").split("\t")
id_match_dict[str(key)] = val
This has worked well so far. Now I am trying to use this dictionary to modify the id of the SeqRecord objects using BioPython (record.id
). My attempts at this have been really bad and don't even want to post what I have written. Suffice it to say, I am at a loss at this point. Could anyone help me on this? (or even point me in the right direction- I have no clue how to approach this problem)
[Please let me know if I need to provide more information, I am trying to keep this brief]
Thank you in advance!
Try this code...
I have a Galaxy tool which would do this nicely for you, https://github.com/peterjc/pico_galaxy/tree/master/tools/seq_rename - written in Python but currently it uses the Galaxy FASTA parser rather than the Biopython one.