Question

Extract A Group Of Fasta Sequences From A File

6

Entering edit mode

15.0 years ago

Kyle ▴ 60

I have a group of fasta sequences numbered sequentially 1,2, etc. and would like to extract a subset by their numbers/identifiers. Using bp_index and bp_fetch in bioperl enables me to extract a single sequences, but I am not familiar enough with perl to automate this for a large group. Any suggestions, ideally in python but also in perl? Thanks.

fasta python perl biopython bioperl • 46k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 15.0 years ago by Kyle ▴ 60

Ram · Answer 1 · 2010-07-15

18

Entering edit mode

15.0 years ago

Eric Normandeau 11k

Hi,

Here is a quick solution in Python.

from Bio import SeqIO

fasta_file = "fasta_file.fasta" # Input fasta file
wanted_file = "wanted_file.txt" # Input interesting sequence IDs, one per line
result_file = "result_file.fasta" # Output fasta file

wanted = set()
with open(wanted_file) as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], f, "fasta")

Cheers!

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 15.0 years ago by Eric Normandeau 11k

3

Entering edit mode

i think that's a poor representation of how easy it is to do this in biopython (sorry eric, but ...). how about something like this:

http://gist.github.com/477969

looks less like java, uses less code, and is more flexible.

ADD REPLY • link 15.0 years ago by brentp 24k

1

Entering edit mode

@eric, what does (number_file) stands for, in your code? Just because I don't understand it. Is it the wanted_file? Thank u!

ADD REPLY • link 14.0 years ago by Peixe ▴ 660

1

Entering edit mode

Hi @peixe, you had guessed right. I changed it to 'wanted_file'. Cheers

ADD REPLY • link 14.0 years ago by Eric Normandeau 11k

1

Entering edit mode

Just found this response (after posting on 2827),but the results file does not actually get anything written to it. One main issue I can think of is I am just trying to compare "seq.id" and not the full "seq.description". Does that make a difference - if my seq.id is less than the full fasta >seq.description? Sorry for the confusion, thanks. -e

ADD REPLY • link 13.5 years ago by User 7521 ▴ 10

1

Entering edit mode

Yes, it matters, .id is the first word only, .description is the full title line from the > line in the FASTA file.

ADD REPLY • link 11.6 years ago by Peter 6.0k

0

Entering edit mode

Wow ! great thanks!

ADD REPLY • link 15.0 years ago by Kyle ▴ 60

0

Entering edit mode

@brentp I think the 'try/except' may be a little scary, but I find them necessary to insure that one gets a sensible warning if ever something goes wrong. Something more precise than 'go look at line 43, index is too big'. :) Your example is very beautiful, and much more pythonesque than my own, however, I wanted to avoid reading the whole file at once. That way, a very large fasta file can be read on a typical student laptop. Since I am toying with NGS data these days, I tend to avoid loading whole files :) Thanks for the very neat example!

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

@eric, my example never holds the entire fasta file in memory. your code will give a message if the header doesn't exist in the file, but that's just one more line!

ADD REPLY • link 15.0 years ago by brentp 24k

0

Entering edit mode

@brentp Sorry, I thought you were using a list comprehension in the SeqIO.write function. Can I then use your suggestions to update my code and make it better? Would you rather post an answer yourself? In any case, thanks for the insight. I've been thinking all night long that there may be something I was missing in my current approach :)

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

@eric, of course you can use the suggestions to update. your answer does the right thing, i just wanted to point out a terser approach.

ADD REPLY • link 15.0 years ago by brentp 24k

0

Entering edit mode

Perfect! Thanks @eric!

ADD REPLY • link 14.0 years ago by Peixe ▴ 660

0

Entering edit mode

Current version has a bug, rather than "if name in wanted" should be "if seq.id in wanted" or "if seq.name in wanted" - you haven't even got a variable called name.

But it should be faster to use a generator expression as in brentp's comment.

ADD REPLY • link 14.0 years ago by Peter 6.0k

0

Entering edit mode

Hi @Peter. You are right on both points. There was a bug (corrected now) and brentp's version is better, I think for quite a few reasons. Cheers

ADD REPLY • link 14.0 years ago by Eric Normandeau 11k

0

Entering edit mode

@Eric Thank you!!!

ADD REPLY • link updated 3.6 years ago by Ram 45k • written 11.0 years ago by ljohanam ▴ 30

0

Entering edit mode

Whenever I try to run the script it says

 File "peptideextract.py", line 12, in <module>
    fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
NameError: name 'SeqIO' is not defined

Where is SeqIO supposed to be defined? Is it supposed to be a column within our fasta and/or text file?

Thanks!

ADD REPLY • link updated 6.8 years ago by Ram 45k • written 10.9 years ago by skilleta0527 • 0

0

Entering edit mode

Have you imported the module? Like this:

from Bio import SeqIO

p.

ADD REPLY • link updated 6.8 years ago by Ram 45k • written 10.8 years ago by Peixe ▴ 660

0

Entering edit mode

<<<<<<<<<<<<<<<<<

ADD REPLY • link 8.2 years ago by jackiegoordial • 0

0

Entering edit mode

Use a for loop over the files found with os.listdir(...) or the glob module?

ADD REPLY • link 10.6 years ago by Peter 6.0k

0

Entering edit mode

Hello I have troubles running this script. I have an ID's file with exactly the same header as the fasta file, the results file is generated but is always empty. What am I doing wrong?

Thanks

ADD REPLY • link 7.3 years ago by catarino.87 • 0

0

Entering edit mode

Thanks very much for this! I have made a few modifications to Eric's script so it can be easily run on different files. I have it saved as 'seq_extractor.py' on my computer:

# Execute by: python seq_extractor.py readsList fastafile outputfile

from Bio import SeqIO
import sys

readsList = open(sys.argv[1], 'rU')
fastafile = sys.argv[2]
outputfile = open(sys.argv[3], 'w')

wanted = set()
with readsList as f:
    for line in f:
        line = line.strip()
        if line != "":
            wanted.add(line)

fasta_sequences = SeqIO.parse(open(fastafile),'fasta')

with outputfile as i:
    for seq in fasta_sequences:
        if seq.id in wanted:
            SeqIO.write([seq], i, "fasta")

readsList.close()
outputfile.close()

Hope that helps.

ADD REPLY • link 7.1 years ago by kwathen-dunn ▴ 10

Ram · Answer 2 · 2010-07-15

12

Entering edit mode

15.0 years ago

brentp 24k

FWIW, you can do this from the commandline if you have pyfasta installed with:

pyfasta extract --header --fasta my.fa header1 header2 header3 subset.fa

or if the headers you want are listed in a file:

pyfasta extract --header --fasta my.fa --file header-list.txt > subset.fa

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 15.0 years ago by brentp 24k

0

Entering edit mode

I tried this I get Error

Traceback (most recent call last):
  File "/usr/local/bin/pyfasta", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/pyfasta/__init__.py", line 38, in main
    globals()[action](sys.argv[2:])
  File "/usr/local/lib/python3.8/site-packages/pyfasta/__init__.py", line 134, in extract
    seq = f[seqname]
  File "/usr/local/lib/python3.8/site-packages/pyfasta/fasta.py", line 128, in __getitem__
    c = self.index[i]
KeyError: '>MITEs101'

ADD REPLY • link 4.8 years ago by bioinfo • 0

Ram · Answer 3 · 2010-07-15

2

Entering edit mode

15.0 years ago

User 59 13k

This question I believe has already been quite well answered here.

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 15.0 years ago by User 59 13k

1

Entering edit mode

The OP is asking for a Python solution preferably, which the other questions aswers do not provide :)

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

The OP is asking for a python solution here, not a Perl one :)

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

no but he did say 'and also in Perl' which the linked answer did. I leave it to y u to pick up the reputation for the accepted solution :D

ADD REPLY • link 15.0 years ago by User 59 13k

0

Entering edit mode

You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my better knowledge, corresponded to the OP's need. Cheers

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

0

Entering edit mode

You are very kind to leave it to me. However, I had no choice in making the decision of accepting this answer. I merely suggested one approach which, to my best knowledge, corresponded to the OP's described need. Cheers

ADD REPLY • link 15.0 years ago by Eric Normandeau 11k

Ram · Answer 4 · 2012-03-15

Here is a solution with GNU pcregrep, this is what I generally use to extract an entry (title+sequence) from a fasta file (in this example gi|116108791 and gi|30261391 in NCBI nr.fasta, you can add more patterns with -e) :

pcregrep --buffer-size=16M -M -e 'gi[[:punct:]]116108791[[:punct:]].*\s([[:alpha:]]+\s*)*' -e 'gi[[:punct:]]30261391[[:punct:]].*\s([[:alpha:]]+\s*)*' nr.fasta

--buffer-size to increase the default buffer-size which was not sufficient for NCBInr

-M for multiline matching

-e for adding a pattern

Depending of your platform, use single/double quotes. This has been tested with the Windows GNU cygwin version 8.21 2011-12-12 of PCREGREP but should work with the pcregrep binary of every Linux distribution. The main limitation is that it has to parse the entire file to return control. I tried to add the '--match-limit' parameter to the number of patterns but gave me some weird errors.

Ram · Answer 5 · 2012-03-15

1

Entering edit mode

13.3 years ago

Israel Barrantes ▴ 790

If you have a list of sequence IDs that you want to extract, you can use the UCSC utilities faOneRecord (to extract just a single record), or faSomeRecords (multiple sequences):

$ faSomeRecords inputfile.fasta listfile outputfile.fasta

Also, using the option '-exclude' will output sequences not present in 'listfile'

ADD COMMENT • link 13.3 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

Hi Israel Barrantes,

I used below cmd to extract RNA seq from fasta file using ID number present in text file.

faSomeRecords nr_latest.fasta sequence.gi.txt outputfile.fasta

After above run was finished, outputfile that is outputfile.fasta has 0 byte size.

Can you please give me suggestion, what am I doing wrong.

Thanks in advance!!

I would really appreciate your help,

Thanks,

Naresh

ADD REPLY • link updated 6.8 years ago by Ram 45k • written 11.6 years ago by nbvasani ▴ 240

Ram · Answer 6 · 2016-04-28

0

Entering edit mode

9.2 years ago

cpcantalapiedra ▴ 170

If besides an identifier you had sequence length, you could create a simple BED file and use "bedtools getfasta".

An example would be:

Create a bed file from your whole fasta file.
Grep the bed lines of your identifers and pipe it to bedtools getfasta -fi whole_fasta -bed - -fo myfasta

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 9.2 years ago by cpcantalapiedra ▴ 170