I know a similar question was asked previously, but I tried Eric Normandeau's code Extracting Multiple Fasta Sequences At A Time From A File Containing Many Sequences. But It did not rsults as like as I wanted. perl code (which works beautifully if you want to extract all sequence, but doesn't extract what I need (about 100 fasta sequences by headers ID from a big file containing 3000 total FASTA files concatenated in one file). Can anyone help? Perhaps it is a simple modification? Here is the code that was posted, I just don't know how to change it so I can input specific headers ID (currently fails, it results no diference in both input and out fasta file, output contain similiar number of sequences). Many thanks to you knowledgeable folks out there! Or is there any other way I can solve my problem
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from Bio import SeqIO
fasta_file = sys.argv[1] # Input fasta file
number_file = sys.argv[2] # header ID txt file
result_file = sys.argv[3] # Output fasta file
wanted = set()
with open(number_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line[25:50])
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
end = False
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id[25:50]:
SeqIO.write([seq], f, "fasta")
Here is my input fasta file:
>Zymo_pool-7_2021-03-23_v1_Hiplex_tag_2_assembled_to_Act
-ACCTCACTCACACCCTCACCACACCGCAAACACCGTCGCAATGGAAGGTAAGCTGCGCCCGCTCCGATTTTGCCCATTGTTGCCTTCCCTATTCTCAATCGCCTTCATCACTGGCAATGCCCTCATATCGGACACAT--GACTGACTCGATTGCCTCCTACAGAAGAAGTTGCCGCTTTGGTCATTGACAATGGGTACGGATGCCTTCTCCCGCCCTCGATTGCGACTCCGCTCAGCTCCGACGTCATCCTCCCGCCCACCACACCCCAGTGAAAACAGCATGGATGAGAGGCGCAAGCGGAGCGCAGCGGACACACATATCCAGCGGACATCACTGATACCCTCCCAGCTCGGGAATGTGCAAGGCCGGTTTCGCCGGTGACGATGCGCCCCGCGCTGTCTTCCGTGAGTTTCA-C--CACTCTATTGACAACTCCAGCGCCAGCCAGCTGACCCTCCCATAGCATCCATTGTCGGCCGACCGCGCCACCACGGGTATGCGATACCAT---CTCCTCCTCCTC-------ACCC---GACTCGTCTAACACCACCCACAGCATCATGATCGGTATGGGCCAAAAGGACTCGTACGTCGGAGA----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>Zymo_pool-7_2021-03-23_v1_Hiplex_tag_20_assembled_to_Act
-ACCTCACTCACACCCTCACCACACCGCAAACACCGTCACAATGGAAGGTAAGCTGCGCCCGCTCCGATTTTGCCCACTGTTGCCTTTCCTATCCTCAATCGCCTTCATCACTGGCAATGCCTTCATATCGGACACAG--GACTGACTCAATTACCTCCCACAGAAGAAGTTGCCGCTTTGGTCATTGACAATGGGTACGGATGCCTTCTCCCGCCCTCGATTGCGACTCCGCTCAGCTCCGACGTCATCCTTCCGCCCACCACACCCCAGTGAAAACAGCATGGATGAGAGGCGCGAGCGGAGCGCAGCGGACACACATATCCAGTGGACATCACTGACATCCTCCCAGCTCGGGAATGTGCAAGGCCGGTTTCGCCGGTGACGATGCGCCCCGCGCTGTCTTCCGTGAGTTTCA-CACCTCTCTATTGACACCTCCAGCGCCAGCCAGCTGACCCTCCCATAGCATCCATTGTCGGCCGACCGCGCCACCACGGGTATGCGATACCAT---CTCCTCCTCCTC-------ACCC---GACTCGTCTAACACCACCCACAGCATCATGATCGGTATGGGCCAAAAGGACTCGTACGTCGGAGA----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>Zymo_pool-7_2021-03-23_v1_Hiplex_tag_32_assembled_to_Act
-ACCTCACTCACACCCTCACCACACCGCAAACACCGTCGCAATGGAAGGTAAGCTGCGCCCGCTCCGATTTTGCCCACTGTTGCCTTTCCTATCCTCAATCGCCTTCATCACTGGCAATGCCTTCATATCGGACACAGGAGACTGACTCAATTACCTCCCACAGAAGAAGTTGCCGCTTTGGTCATTGACAATGGGTACGGATGCCTTCTCCCGCCCTCGATTGCGACTCCGCTCAGCTCCGACGTCATCCTTCCGCCCACCACACCCCAGTGAAAACAGCATGGATGAGAGGCGCGAGCGGAGCGCAGCGGACACACATATCCAGTGGACATCACTGACATCCTCCCAGCTCGGGAATGTGCAAGGCCGGTTTCGCCGGTGACGATGCGCCCCGCGCTGTCTTCCGTGAGTTTCA-CACCTCTCTATTGACACCTCCAGCGCCAGCCAGCTGACCCTCCCATAGCATCCATTGTCGGCCGACCGCGCCACCACGGGTATGCGACGCCAT---CTCCTCCTCCTCCTC-CTCCCTC---GACTCGTCTAACATCACCCACAGCATCATGATCGGTATGGGCCAAAAGGACTCGTACGTCGGAGA----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
And here is my headers IDs text file (Headers are in separate line):
>Zymo_pool-7_2021-03-23_v1_Hiplex_tag_20_assembled_to_Act
>Zymo_pool-7_2021-03-23_v1_Hiplex_tag_32_assembled_to_Act
Result file should look like:
>Zymo_pool-7_2021-03-23_v1_Hiplex_tag_20_assembled_to_Act
-ACCTCACTCACACCCTCACCACACCGCAAACACCGTCACAATGGAAGGTAAGCTGCGCCCGCTCCGATTTTGCCCACTGTTGCCTTTCCTATCCTCAATCGCCTTCATCACTGGCAATGCCTTCATATCGGACACAG--GACTGACTCAATTACCTCCCACAGAAGAAGTTGCCGCTTTGGTCATTGACAATGGGTACGGATGCCTTCTCCCGCCCTCGATTGCGACTCCGCTCAGCTCCGACGTCATCCTTCCGCCCACCACACCCCAGTGAAAACAGCATGGATGAGAGGCGCGAGCGGAGCGCAGCGGACACACATATCCAGTGGACATCACTGACATCCTCCCAGCTCGGGAATGTGCAAGGCCGGTTTCGCCGGTGACGATGCGCCCCGCGCTGTCTTCCGTGAGTTTCA-CACCTCTCTATTGACACCTCCAGCGCCAGCCAGCTGACCCTCCCATAGCATCCATTGTCGGCCGACCGCGCCACCACGGGTATGCGATACCAT---CTCCTCCTCCTC-------ACCC---GACTCGTCTAACACCACCCACAGCATCATGATCGGTATGGGCCAAAAGGACTCGTACGTCGGAGA----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>Zymo_pool-7_2021-03-23_v1_Hiplex_tag_32_assembled_to_Act
-ACCTCACTCACACCCTCACCACACCGCAAACACCGTCGCAATGGAAGGTAAGCTGCGCCCGCTCCGATTTTGCCCACTGTTGCCTTTCCTATCCTCAATCGCCTTCATCACTGGCAATGCCTTCATATCGGACACAGGAGACTGACTCAATTACCTCCCACAGAAGAAGTTGCCGCTTTGGTCATTGACAATGGGTACGGATGCCTTCTCCCGCCCTCGATTGCGACTCCGCTCAGCTCCGACGTCATCCTTCCGCCCACCACACCCCAGTGAAAACAGCATGGATGAGAGGCGCGAGCGGAGCGCAGCGGACACACATATCCAGTGGACATCACTGACATCCTCCCAGCTCGGGAATGTGCAAGGCCGGTTTCGCCGGTGACGATGCGCCCCGCGCTGTCTTCCGTGAGTTTCA-CACCTCTCTATTGACACCTCCAGCGCCAGCCAGCTGACCCTCCCATAGCATCCATTGTCGGCCGACCGCGCCACCACGGGTATGCGACGCCAT---CTCCTCCTCCTCCTC-CTCCCTC---GACTCGTCTAACATCACCCACAGCATCATGATCGGTATGGGCCAAAAGGACTCGTACGTCGGAGA----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
same results happen when I run this script too
#!/usr/bin/env python
# a script to extract fasta records from a fasta file to multiple separate fasta files based on a particular ID (time point) in a part$# to run, navigate to file location with command prompt and enter: python split_fasta_by_collections.py infile.fasta
from Bio import SeqIO
import os
import sys
records = SeqIO.parse(sys.argv[1], "fasta")
collected = {}
for record in records:
descr = record.description.split("_")[6] # "_" sets the delimeter, "1" sets the field where counting starts at 0 for the first fie$ try:
collected[descr].append(record)
except KeyError:
collected[descr] = [record ,]
file_name = "outfile_tag_%s.fa"
file_path = os.getcwd() #sets the output file path to your current working directory
for (descr, records) in collected.items():
with open(os.path.join(file_path, file_name % descr), 'w') as f:
SeqIO.write(records, f, 'fasta')
If it's a really big file, I'd recommend looking in to using
SeqIO.index()
rather than.read()
or.parse()
, unless you're happy to just read through the file a single time.