Hi all, I need a trained python eye for this :)
I need to remove 100's of genes from a proteome file contains 1000s genes. Obviously I do not want to do it manually. I have pulled the python code pasted below from somewhere, which is a few years old. It is supposed to do what I want, but it does not. It just copies all the files from the original file to the output file, ignoring the remove.file. This code requires 3 files which I supplied. File 1; "123.fasta" - the file with my original unedited proteome, file 2; "remove.txt" - the file with the list of gene ID's to be removed. File 3. "new.fasta" - the output file with the edited proteome minus the genes listed in the remove.txt file. Ideally, I would like the code to identify the genes in "123.fasta" by the fasta format sequence ID (eg. >sequence1, >sequence2 etc).
This is the code:
import Bio
from Bio import SeqIO
import sys
fasta_file = ("123.fasta")
remove_file = ("remove.txt")
result_file = ("new.fasta")
remove = set (">")
with open(remove_file) as f:
for line in f:
line = line.strip()
if line != "":
remove.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file), "fasta")
with open(result_file, "w") as f:
for seq in fasta_sequences:
nam = str()
nam = nam.stripseq.id)
nuc = str(seq.seq)
SeqIO.write([seq], f, "fasta")
As I said, no matter what I tweak, it just copies and pastes all of the 123.fasta file into the output file, no deletions. Any of the python people see what may be the problem? I am not a trained python operator , just using it for my work.
Since this question is not about python code you wrote consider
faSomeRecords
utility from Jim Kent (LINK). After downloading the file add execute permissions (chmod u+x faSomeRecords
). Use as follows