I have a fasta file like this:
>Contig1
aattata
>Contig2
ttgtgtgt
>Contig3
cgcgtgtgt
>Contig4
tgtgttgtttttt
>Contig5
cccgcgctg
>Contig6
tgtgtgtgtgtgt
>Contig7
aacgctg
>Contig8
acgtgtgc
I have a 2nd txt file:
1
5
7
I want extract a subset from the fasta file if the header contains EXACTLY the number in the txt file, like this:
>Contig1
aattata
>Contig5
cccgcgctg
>Contig7
aacgctg
I don't want >Contig10 even it contains the number 1.
How to do this in python? I tried scripts other people provided but they both give me empty output file. I don't know what I did wrong: script1:
"""
%prog some.fasta wanted-list.txt
"""
from Bio import SeqIO
import sys
wanted = [line.strip() for line in open(sys.argv[2])]
print(wanted)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
SeqIO.write((seq for seq in seqiter if seq.id in wanted), sys.stdout, "fasta")
sys.stdout.close()
Script2:
from Bio import SeqIO
import sys
fasta_file = sys.argv[1] # Input fasta file
wanted_file = sys.argv[2] # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
print(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
Why are you reinventing the wheel? There already exist a bunch of tools that do this!
You can try this: