Hi all friends,
I have a large fasta file that most sequences have a identical header (they differ from the length). I usually extracted the sequences of interest with the following python codes:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Extract sequences from a fasta file if their name is in a 'wanted' file.
Wanted file contains one sequence name per line.
Usage:
%program <input_file> <wanted_file> <output_file>"""
import sys
import re
try:
from Bio import SeqIO
except:
print "This program requires the Biopython library"
sys.exit(0)
try:
fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input wanted file, one gene name per line
result_file = sys.argv[3] # Output fasta file
except:
print __doc__
sys.exit(0)
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
name = seq.id
if name in wanted and len(seq.seq.tostring()) > 0:
wanted.remove(name) # Output only the first appearance for a name
SeqIO.write([seq], f, "fasta")
But the code just extracted one of the sequences with identical header, while I want to all sequences of interest that may have identical headers and adding the sequential number at the beginning or end of identical headers in the extracted fasta file? something like:
header_1
header_2
header_3
etc.
Many thanks
Thanks, I name it py_ext.py, but it retutned the below error:
I tried to convert tab to space, but the error still remain. Would you please check the code?
I edited my answer. The code you pasted in your question has the problem with indentations (no idents after try: or with:) so it doesn't work at all. I would test my code if you provide some sample lines of your input files.
Thank you very much. It worked well and I accepted it.
and after changing it
I updated the code to Python3.