Here's a python script that does what (I think) you want. Let me know if it works or not.
EDIT: Bad formatting.
# Dictionary with strings to replace and what to replace them with
replace_strings = {}
with open("replace_ids.tsv", "r") as id_file:
# Read file line-by-line
for line in id_file.readlines():
# Split line on TAB
ids = line.strip().split("\t")
# Fist entry is the original ID
original_id = ids[0]
# Second entry is your ID
my_id = ids[1]
# Add both to our dictionary of strings to replace
replace_strings[original_id] = my_id
# Read file with sequences, called "sequence.txt"
with open("sequence.txt", "r+") as infile:
# Read each line of file into a list
content = infile.readlines()
# Keep a list of the lines with the replaced strings
new_content = []
# Loop lines in the file content
for line in content:
new_line = line
# Find and replace any original_id with your own ids in the line of content and add it to our list of replaced lines
for original_id, my_id in replace_strings.items():
new_line = new_line.replace(original_id, my_id)
new_content.append(new_line)
# Write replaced content to a new file called "outfile.txt"
with open("outfile.txt", "w") as outfile:
for line in new_content:
outfile.write(line)
EDIT2:
replace_ids.tsv is your tab-separated file of IDs. I've assumed that the original ID is at the first position and that your own ID is at the second position, e.g.: orignal_id TAB your_id
sequence.txt is the name of the file containing the sequence(s).
Looks like the code gets the job done, and nicely commented too. But I can't resist to offer you some suggestions:
Avoid .readlines(). It loads the entire file in memory which may be problematic for huge files. Just loop over the file:
Example:
for line in id_file:
ids = line.strip().split('\t')
replace_strings[ids[0]] = ids[1]
Similar, try to process the sequence file while looping without keeping it in memory. Read a record, write a record with corrected identifier
Use the sys module to get arguments from the command line to the script, executing the script as python myscript.py file1 file2,
in the script you will then have sys.argv[1] be file1 and sys.argv[2] file2.
Probably for this particular case those suggestions are pedantic and unnecessary, but for larger files those issues become very real.
I also found this script very useful, but I have a question.
When we get to the outfile, how does it know to write the new lines while keeping the rest of the lines from the original file? i.e. How does it know to write new_content (which has our new ids) AND write the sequences following those ids?
Even in the second part using infile, I don't understand how it does not lose all the other lines following the original_ids since we are keeping a list of only replaced lines (which are the ids)?
If someone could explain this concept, I would appreciate it!
Hi, what we're doing here is essentially reading the input file, replacing some IDs (if there's an ID to replace, otherwise we just copy the line), then write it to a new file:
The content variable contains all the data in the infile, i.e. both IDs and the sequence strings. Thus when we loop content, we'll first read a line with an ID, then a line with a sequence, then a line with an ID, then a line with a sequence, etc. We store this line in the new_line variable (new_line = line). We then try to replace the original IDs with our own IDs in new_line using a for-loop and the replace() function. Here, if new_line contains and ID, we'll replace the old ID with our own ID. However, if new_line happens to be a line with a sequence (not an ID), nothing will change, the line will remain as it was read, containing only the sequence. After this, we add new_line to our new_content variable, which is written to file.
To answer your question, we're not only looking at lines containing IDs, we're looking at those including sequences as well, but we're not altering those. Early morning write-up, hope this makes sense.
This may also also be useful
sed -e '/Aquca_005_00546.1/RaAc00546A/' filename.fa > filenameedit.fa