I have a group of fasta sequences numbered sequentially 1,2, etc. and would like to extract a subset by their numbers/identifiers. Using bp_index and bp_fetch in bioperl enables me to extract a single sequences, but I am not familiar enough with perl to automate this for a large group. Any suggestions, ideally in python but also in perl?
Thanks.
from Bio import SeqIO
fasta_file = "fasta_file.fasta" # Input fasta file
wanted_file = "wanted_file.txt" # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
Just found this response (after posting on 2827),but the results file does not actually get anything written to it. One main issue I can think of is I am just trying to compare "seq.id" and not the full "seq.description". Does that make a difference - if my seq.id is less than the full fasta >seq.description?
Sorry for the confusion, thanks.
-e
@brentp I think the 'try/except' may be a little scary, but I find them necessary to insure that one gets a sensible warning if ever something goes wrong. Something more precise than 'go look at line 43, index is too big'. :) Your example is very beautiful, and much more pythonesque than my own, however, I wanted to avoid reading the whole file at once. That way, a very large fasta file can be read on a typical student laptop. Since I am toying with NGS data these days, I tend to avoid loading whole files :) Thanks for the very neat example!
@eric, my example never holds the entire fasta file in memory.
your code will give a message if the header doesn't exist in the file, but that's just one more line!
@brentp Sorry, I thought you were using a list comprehension in the SeqIO.write function. Can I then use your suggestions to update my code and make it better? Would you rather post an answer yourself? In any case, thanks for the insight. I've been thinking all night long that there may be something I was missing in my current approach :)
Current version has a bug, rather than "if name in wanted" should be "if seq.id in wanted" or "if seq.name in wanted" - you haven't even got a variable called name.
But it should be faster to use a generator expression as in brentp's comment.
Hello
I have troubles running this script.
I have an ID's file with exactly the same header as the fasta file, the results file is generated but is always empty.
What am I doing wrong?
Thanks very much for this! I have made a few modifications to Eric's script so it can be easily run on different files. I have it saved as 'seq_extractor.py' on my computer:
# Execute by: python seq_extractor.py readsList fastafile outputfile
from Bio import SeqIO
import sys
readsList = open(sys.argv[1], 'rU')
fastafile = sys.argv[2]
outputfile = open(sys.argv[3], 'w')
wanted = set()
with readsList as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fastafile),'fasta')
with outputfile as i:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], i, "fasta")
readsList.close()
outputfile.close()
Traceback (most recent call last):
File "/usr/local/bin/pyfasta", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/site-packages/pyfasta/__init__.py", line 38, in main
globals()[action](sys.argv[2:])
File "/usr/local/lib/python3.8/site-packages/pyfasta/__init__.py", line 134, in extract
seq = f[seqname]
File "/usr/local/lib/python3.8/site-packages/pyfasta/fasta.py", line 128, in __getitem__
c = self.index[i]
KeyError: '>MITEs101'
You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my better knowledge, corresponded to the OP's need. Cheers
You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my best knowledge, corresponded to the OP's described need. Cheers
Here is a solution with GNU pcregrep, this is what I generally use to extract an entry (title+sequence) from a fasta file (in this example gi|116108791 and gi|30261391 in NCBI nr.fasta, you can add more patterns with -e) :
--buffer-size to increase the default buffer-size which was not sufficient for NCBInr
-M for multiline matching
-e for adding a pattern
Depending of your platform, use single/double quotes. This has been tested with the Windows GNU cygwin version 8.21 2011-12-12 of PCREGREP but should work with the pcregrep binary of every Linux distribution.
The main limitation is that it has to parse the entire file to return control. I tried to add the '--match-limit' parameter to the number of patterns but gave me some weird errors.
If you have a list of sequence IDs that you want to extract, you can use the UCSC utilities faOneRecord (to extract just a single record), or faSomeRecords (multiple sequences):
i think that's a poor representation of how easy it is to do this in biopython (sorry eric, but ...). how about something like this:
http://gist.github.com/477969
looks less like java, uses less code, and is more flexible.
@eric, what does (number_file) stands for, in your code? Just because I don't understand it. Is it the wanted_file? Thank u!
Hi @peixe, you had guessed right. I changed it to 'wanted_file'. Cheers
Just found this response (after posting on 2827),but the results file does not actually get anything written to it. One main issue I can think of is I am just trying to compare "seq.id" and not the full "seq.description". Does that make a difference - if my seq.id is less than the full fasta >seq.description? Sorry for the confusion, thanks. -e
Yes, it matters,
.id
is the first word only,.description
is the full title line from the>
line in the FASTA file.Wow ! great thanks!
@brentp I think the 'try/except' may be a little scary, but I find them necessary to insure that one gets a sensible warning if ever something goes wrong. Something more precise than 'go look at line 43, index is too big'. :) Your example is very beautiful, and much more pythonesque than my own, however, I wanted to avoid reading the whole file at once. That way, a very large fasta file can be read on a typical student laptop. Since I am toying with NGS data these days, I tend to avoid loading whole files :) Thanks for the very neat example!
@eric, my example never holds the entire fasta file in memory. your code will give a message if the header doesn't exist in the file, but that's just one more line!
@brentp Sorry, I thought you were using a list comprehension in the SeqIO.write function. Can I then use your suggestions to update my code and make it better? Would you rather post an answer yourself? In any case, thanks for the insight. I've been thinking all night long that there may be something I was missing in my current approach :)
@eric, of course you can use the suggestions to update. your answer does the right thing, i just wanted to point out a terser approach.
Perfect! Thanks @eric!
Current version has a bug, rather than "if name in wanted" should be "if seq.id in wanted" or "if seq.name in wanted" - you haven't even got a variable called name.
But it should be faster to use a generator expression as in brentp's comment.
Hi @Peter. You are right on both points. There was a bug (corrected now) and brentp's version is better, I think for quite a few reasons. Cheers
@Eric Thank you!!!
Whenever I try to run the script it says
Where is SeqIO supposed to be defined? Is it supposed to be a column within our fasta and/or text file?
Thanks!
Have you imported the module? Like this:
p.
Use a for loop over the files found with
os.listdir(...)
or theglob
module?Hello I have troubles running this script. I have an ID's file with exactly the same header as the fasta file, the results file is generated but is always empty. What am I doing wrong?
Thanks
Thanks very much for this! I have made a few modifications to Eric's script so it can be easily run on different files. I have it saved as 'seq_extractor.py' on my computer:
Hope that helps.