Extract fasta sequences from a large file using a list of names
6
3
Entering edit mode
9.6 years ago
fhsantanna ▴ 620

I have a large fasta file of 16S sequences and I want to retrieve sequences using a list of organism names. Do you know a script capable of doing it?

EDIT:

Headers look like that:

>S000000859 Bacillus sp. USC14; AF346495
sequence
>S000001027 Paenibacillus borealis; KN25; AJ011325
sequence

And I have a list like the following:

Paenibacilus borealis
Paenibacillus sp. 1-18
Paenibacillus sp. 1-49
Paenibacillus sp. A9
Paenibacillus sp. Aloe-11

I want to retrieve those sequences that match with names present in the list.

fasta • 19k views
ADD COMMENT
0
Entering edit mode

Can you show us some example headers from your FASTA file? Right now I'm thinking to put all the organism names in a text file and simply use grep -f

ADD REPLY
0
Entering edit mode

I believe I was not clear. My 16S fasta file has sequences of hundreds of species. I have to retrieve just dozens of them using a list of organisms of interest.

ADD REPLY
1
Entering edit mode

If the fasta sequence is in one line it's":

grep -A1 'organism name'

If it's not it would be easier to format it into one line and then do grep.

ADD REPLY
0
Entering edit mode

A good way to start. Very simple. Thanks. The problem is when the sequences have different sizes.

ADD REPLY
5
Entering edit mode
9.6 years ago
tlawrence3 ▴ 70

You can do this easily by installing the FAST: Fast Analysis of Sequences Toolbox (publication)(github).

You can install it by using this command (only use sudo if you need to):

(sudo) perl -MCPAN -e 'install FAST'

Once it is installed here is a small bash script to do what you need:

#!/bin/bash

while read line ;do
    cat original_fasta.fa | fasgrep -di "$line"  >> reduced_fasta_file.fa
done < species.txt
ADD COMMENT
0
Entering edit mode

Asking just for clarification. In above bash script which one is the file containing the fasta sequence and which one is the file containing list of organism names?

ADD REPLY
0
Entering edit mode

Updated the example to make it more clear.

ADD REPLY
0
Entering edit mode

Very good. I was able to extract the sequences running the program one by one.

But when I try the bash script, the following error message appears: "fasgrep: expects one regex argument when input is on STDIN."

ADD REPLY
0
Entering edit mode

Try ./faSomeRecords very simple one as shown

ADD REPLY
0
Entering edit mode

tlawrence, do you know why I had trouble with your bash script (error message appears: fasgrep: expects one regex argument when input is on STDIN.)?

It seems that fasgrep is not recognizing the $line variable.

ADD REPLY
0
Entering edit mode

It is something in the bash script. I don't know bash scripting well enough to figure it out.

This works on the command line:

fasgrep -di "Paenibacilus borealis" fasta_file.fa

However, it doesn't work in the script. Maybe someone here can help.

ADD REPLY
0
Entering edit mode

I updated the script. It should work now.

ADD REPLY
0
Entering edit mode

It did not work for me.

ADD REPLY
0
Entering edit mode

What didn't work? Did you get an empty fasta file? The wrong sequences? An error? Does doing the command manually give the expected results?

ADD REPLY
0
Entering edit mode

As stated before, fasgrep works fine manually (I accept your answer, as you can see).

However, the bash script did not work (I had to write one without using a loop, but writing a line for every organism of interest). Your bash script did not filtered the organisms of interest, all of the original sequences were included in the output and more...

ADD REPLY
0
Entering edit mode

I guess my bash scripting skills need some improvement. Since I am one of the authors of the toolkit I want to make sure it is working for as many people as possible.

ADD REPLY
0
Entering edit mode

Never mind. Great toolkit. Thanks again.

Just one thing, maybe I found a bug.

Considering a fasta file like the following (note the absence of a underscore between "S000001027" and "Paenibacillus_borealis"):

>S000001027 Paenibacillus_borealis;_KN25;_AJ011325
AAAAAAAAAATTTTTTT

When I type:

fasgrep -di "Paenibacillus_borealis" input.fas

Fasgrep is able to find the sequence of interest.

However, if the fasta sequence has a underscore between S000001027 and Paenibacillus_borealis (>S000001027_Paenibacillus_borealis;_KN25;_AJ011325), fasgrep cannot find the sequence.

ADD REPLY
1
Entering edit mode

It is not a bug. fastA format is loosely defined as:

>identifier<space>description<newline>
<sequence data>

So when you don't have a space in the header line all you have is an identifier which fasgrep searches by default. So instead of:

fasgrep -di

which searches the description field you would use:

fasgrep -i
ADD REPLY
0
Entering edit mode

+1 for properly handling the fasta defline info field.

ADD REPLY
5
Entering edit mode
9.6 years ago
h.mon 35k

The following script should work:

#!/usr/bin/env perl
my $list_file = $ARGV[0];
my $fasta_in = $ARGV[1];
my $fasta_out = $ARGV[2];
open(LIST_FILE, "<", $list_file) or die "could not open '$list_file' : $! \n";
open(FASTA_IN, "<", $fasta_in) or die "could not open '$fasta_in' : $! \n";
open(FASTA_OUT, ">", $fasta_out) or die "could not open $fasta_out : $! \n";
my @headers = ();
while(<LIST_FILE>) {
    chomp;
    next if ( /^\s*$/ );
    push(@headers, $_.";");
}
my $pat = join '|', map quotemeta, @headers;
$/ = ">";
while(<FASTA_IN>) {
    chomp;
    if ( /$pat/ ) { print FASTA_OUT ">$_"; }
}
close(LIST_FILE);
close(FASTA_IN);
close(FASTA_OUT);

Call it with:

./getFastas.pl list.txt sequences.fasta sequences.out
ADD COMMENT
0
Entering edit mode

It did not work. Blank output.

ADD REPLY
0
Entering edit mode

I tested with the following mock files:

list.txt

Paenibacillus borealis
Paenibacillus sp. 1-18
Paenibacillus sp. 1-49
Paenibacillus sp. A9
Paenibacillus sp. Aloe-11

sequences.fasta:

>S000000859 Bacillus sp. USC14; AF346495
ATGCATGCATGCATGCATGCATGC
>S000001027 Paenibacillus borealis; KN25; AJ011325
GCGCGCATATATGCGCGCATATAT
>S000001026 Paenibacillus borealisensis; KN26; AJ011326
GCGCGCATATATGCGCGCATATAT
>Simaginary Streptomyces borealis; KN25; AJ011325
AAAAAAAAAAAAAAAAAAAAAAAA
>S000000000 Paenibacillus sp. 1-49; nvhjdasnviuas
GGGGGGGGGGGGGGGGGGGGGGGG
>S000001027_Paenibacillus borealis; KN25; AJ011325
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

worked all right for me, retrieving three sequences ( "S000001027 Paenibacillus borealis; KN25; AJ011325", "S000001027_Paenibacillus borealis; KN25; AJ011325" and "S000000000 Paenibacillus sp. 1-49; nvhjdasnviuas" ).

ADD REPLY
0
Entering edit mode

For the mock sequences.fasta, it works fine. However, it does not work if the sequences.fasta has the following sequence:

>AB042938.1.1462 Paenibacillus borealis
GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGGACUUUGCCUUCGGGUAAAGUUAGCGGCGGACGGGUGAG
UAACACGUGGGUAACCUGCCCAUAAGACUGGGAUAACAUUCGGAAACGAAUGCUAAUACCGGAUACGCGAAUCGGUCGCA
UGAUCGAAUCGGGAAAGGCGGAGCAAUCUGCCACUUAUGGAUGGACCCGCGGCGCAUUAGCUAGUUGGUGGGGUAACGGC
UCACCAAGGCGACGAUGCGUAGCCGACCUGAGAGGGUGAUCGGCCACACUGGGACUGAGACACGGCCCAGACUCCUACGG
GAGGCAGCAGUAGGGAAUCUUCCGCAAUGGACGAAAGUCUGACGGAGCAACGCCGCGUGAGUGAUGAAGGUUUUCGGAUC
GUAAAGCUCUGUUGCCAGGGAAGAACGCUUGCGAGAGUAACUGCUCGCAAGGUGACGGUACCUGAGAAGAAAGCCCCGGC
UAACUACGUGCCAGCAGCCGCGGUAAUACGUAGGGGGCAAGCGUUGUCCCGGAAUUAUUGGGCGUAAAGCGCGCGCAGGC
GGCCUUGUAAGUCUGUCGUUUAAACUCGGAGCUCAACUUCGAGUCGCGAUGGAAACUGCAAAGCUUGAGUGCAGAAGAGG
AAAGUGGAAUUCCACGUGUAGCGGUGAAAUGCGUAGAGAUGUGGAGGAACACCAGUGGCGAAGGCGACUUUCUGGGCUGU
AACUGACGCUGAGGCGCGAAAGCGUGGGGAGCAAACAGGAUUAGAUACCCUGGUAGUCCACGCCGUAAACGAUGAAUGCU
AGGUGUUAGGGGUUUCGAUACCCUUGGUGCCGAAGUUAACACAUUAAGCAUUCCGCCUGGGGAGUACGGUCGCAAGACUG
AAACUCAAAGGAAUUGACGGGGACCCGCACAAGCAGUGGAGUAUGUGGUUUAAUUCGAAGCAACGCGAAGAACCUUACCA
GGUCUUGACAUCCCUCUGACCGUCCUAGAGAUAGGGCUUUCCUUCGGGACAGAGGAGACAGGUGGUGCAUGGUUGUCGUC
AGCUCGUGUCGUGAGAUGUUGGGUUAAGUCCCGCAACGAGCGCAACCCUUGAUCUUAGUUGCCAGCACUUUGGGUGGGCA
CUCUAGGAUGACUGCCGGUGACAAACCGGAGGAAGGUGGGGAUGACGUCAAAUCAUCAUGCCCCUUAUGACCUGGGCUAC
ACACGUACUACAAUGGCCGAUACAACGGGAAGCGAAACCGCGAGGUGGAGCCAAUCCUAUCAAAGUCGGUCUCAGUUCGG
AUUGCAGGCUGCAACUCGCCUGCAUGAAGUCGGAAUUGCUAGUAAUCGCGGAUCAGCAUGCCGCGGUGAAUACGUUCCCG
GGUCUUGUACACACCGCCCGUCACACCACGAGAGUUUACAACACCCGAAGCCGGUGGGGUAACCGCAAGGAGCCAGCCGU
CGAAGGUGGGGUAGAUGAUUGG
ADD REPLY
1
Entering edit mode

No it doesn't, because it lacks the semicolon ( ; ) at the end of the species / strain name in the fasta header. I've used the semicolon because all the fasta headers you provided had a semicolon, and it was a good way to, e.g., select "Paenibacillus borealis" (which is included in your list) and avoid "Paenibacillus borealisensis" (which I made up and is not included in your list). That is, it was just a way of avoiding longer names which could contain shorter ones, but were not in the list.

Anyways, if you change:

push(@headers, $_.";");

to:

push(@headers, $_);

it should work again.

ADD REPLY
4
Entering edit mode
9.6 years ago
Juke34 8.9k

BASH

IFS=$'\n'
for i in $(cat headersFILE)
do
  line=$(grep -nr "$i" multiFastaFILE)
  if [[ ! -z $line ]]
  then
    for j in $line
    do
      lineNumber=$(echo $j | cut -d':' -f1)
      sed -n "$lineNumber p" multiFastaFILE
      awk -v nb=$lineNumber 'NR > nb {if ($0 ~ ">") exit; else print $0 }' multiFastaFILE
    done
  fi
done

Where headersFILE is the file containing one header per line ; multiFastaFILE is obviously your multifasta file.

Every time a fasta sequence header of your multiFastaFILE contains a name described in your headersFILE, the header and its sequence will be displayed.

ADD COMMENT
0
Entering edit mode

It worked very well! Thanks!

ADD REPLY
3
Entering edit mode
9.6 years ago

You could use Biopython's fasta parser. I didn't test this, but this would be the basic idea:

from Bio import SeqIO
handle = open("example.fasta", "rU")
organism_name = "Paenibacilus borealis"
for record in SeqIO.parse(handle, "fasta"):
    if organism_name in record.id:
        # Do whatever you want with these
        print record.id
        print record.seq
handle.close()

There's more info here: http://biopython.org/wiki/SeqIO

ADD COMMENT
0
Entering edit mode

I have a similar problem, I have many species name in one fasta file and I want them to copy those files which contain the desired species sequence.

The file is like this

>CanaanFir115301    PQPIVPPAASTSSEATTSTDVPQPIVHPAASTSSEATTSPVAFPARDNVAGTSTTSTDVPQPIVPPAASTSSEATTSPVA
FPARDNVAGTSSSSSVSKASPLIHDIVAAARAVVTDADLDDLIGNIFSLLEHIKAKKTETKSS
>FraserFir31480
MGKTTLADAVFARVYVEGCKYSMVRLFDDVTSTPSTSHIVKLQKYILEDLKMGTPEETTHE
>CanaanFir58880
SSTDVLPESNEIAPVDHFSVTSTAAPNATDHGPQPIVPPDASSSSTDVLPESNEIAPVDHFSVTSTAAPNATDDALKHPQ
HGPQPILPPDASSSSTDVLPDSNEPQPIVHPDLSSSSPLAFPAHDEVYQVSEAASSSNAATDNHKVADDQLQGTPSSSTV
SNAPHLINNILTAANALDT
>BalsamFir4479
MEKKSCRCRLFSCFQCCFSAQRPDPSVTSTAATDDALHHRPQPIVPSAASTSSEPTTSTDVPQPIVHPAASASSKFSCFQ
CCFSAQRPDPSVTSTAATDDALHHRPQPIVPSAASTSSEPTTSADVPQPIVHPAASTSSEATT

I want to copy the files which contain the sequence (Balsam, Fraser and Canaan) species. The rest will be discarded.

How can I do that?

ADD REPLY
2
Entering edit mode
9.6 years ago
venu 7.1k

As your list is containing 'Paenibaci' in common, I have followed this approach

grep '^>' input.fasta > headers.txt
sed -n '/Paenibaci/p' headers.txt > filter_headers.txt
$ ./faSomeRecords input.faa filter_headers.txt output.faa

perl code to extract sequences from multi-line fasta works on all test files but not on research files

ADD COMMENT
0
Entering edit mode

Not worked. Output file is empty.

ADD REPLY
0
Entering edit mode

Can you provide some more organism names from the organisms file? You have only provided one name, post 5-10 line from organism file

ADD REPLY
0
Entering edit mode

Edited.

ADD REPLY
0
Entering edit mode

try the edited one..it might work

ADD REPLY
0
Entering edit mode

I think you have not understood. I already have a "header" file (organism name), which was not derived from a fasta file (so, grep and sed commands are unnecessary). I want to collect those sequences that contain specif organism names ("headers") from a FASTA file.

ADD REPLY
2
Entering edit mode
9.6 years ago
pip install --user pyfaidx 
xargs faidx 16s.fasta < organisms.txt
ADD COMMENT
0
Entering edit mode

I wasn't familiar with that package and it looks really nice. Thanks!

ADD REPLY
0
Entering edit mode

Firstly, the pip install did not work. I had to install using setup.py.

When I try to run the script, I get the following error:

Traceback (most recent call last):
  File "/usr/local/bin/faidx", line 9, in <module>
    load_entry_point('pyfaidx==0.4.0', 'console_scripts', 'faidx')()
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 149, in main
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 40, in write_sequence
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 50, in fetch_sequence
KeyError: 'Paenibacillus_alginolyticus_DSM_5050'

My command was:

xargs faidx rdp_download_2882seqs.fasta < organisms.txt

where organisms.txt is the list of organism names.

ADD REPLY
0
Entering edit mode

Regarding the pip install issue, you probably don't have pip installed, but it seems that you figured that out. The KeyError is telling you that the organism name Paenibacillus_alginolyticus_DSM_5050 does not exist as a FASTA header line in your file. I should probably catch this and warn the user instead of having the script bail.

ADD REPLY
0
Entering edit mode

Yes, I had pip installed, and it installed pyfaidx. However faidx was not executable it in terminal, only in python.

In my list of species, each line has a organism name. Paenibacillus_alginolyticus_DSM_5050 was the first one. So, if the first name is not located in my fasta file, faidx does not try to locate other names?

ADD REPLY
0
Entering edit mode

That is correct, but I just submitted a change that fixes that behavior. The current github master branch will just warn the user and move to the next query without exiting with an error code.

ADD REPLY
0
Entering edit mode

I've just pushed a new version of pyfaidx (v0.4.1) that will produce a warning for each sequence that is not found in your file, but will continue without exiting. You can redirect stderr to a log file to figure out which organisms are missing:

xargs faidx rdp_download_2882seqs.fasta < organisms.txt 2> missing.log
ADD REPLY
0
Entering edit mode

Not worked for me. Here is the log:

..
warning: Paenibacillus not found in file
warning: sp. not found in file
warning: 1-49 not found in file
..

Of course the strain Paenibacillus sp. 1-49 is in rdp_download_2882seqs.fasta and organisms.txt files.

ADD REPLY
0
Entering edit mode

Ah, yes. There can be no spaces in the sequence names. "xargs" passes each line in organisms.txt as the arguments for "faidx". If there are spaces, they are interpreted as separate sequence names. Generally it is advisable to not include spaces in the FASTA sequence names, as most programs will truncate anything after the first space.

ADD REPLY
0
Entering edit mode

Still not working. Here is the log:

warning: Paenibacillus_sp._1-49 not found in file

In my rdp_download_2882seqs.fasta I have the following sequence:

>ASRY01000111.1.1112_Paenibacillus_sp._1-49
GAGUAACACGUAGGCAACCUGCCCAUGAGACUGGGAUAACUACCGGAAACGGUAGCUAAUACCGGAUACAUCCUUUCCCU
GCAUGGGGAGGGGAGGAAAGACGGAGCAAUCUGUCACUGAUGGAUGGGCCUGCGGCGCAUUAGCUAGUUGGUGGGGUAAU
ADD REPLY
0
Entering edit mode

The issue here is that you are expecting to find an inexact match. You can do this using:

xargs faidx rdp_download_2882seqs.fasta --regex  < organisms.txt 2> missing.log

The --regex option specifies that the sequence names are regular expressions patterns to be matched.

ADD REPLY
0
Entering edit mode
Traceback (most recent call last):
  File "/usr/local/bin/faidx", line 9, in <module>
    load_entry_point('pyfaidx==0.4.1', 'console_scripts', 'faidx')()
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 153, in main
  File "build/bdist.linux-x86_64/egg/pyfaidx/cli.py", line 16, in write_sequence
  File "build/bdist.linux-x86_64/egg/pyfaidx/__init__.py", line 577, in __init__
  File "build/bdist.linux-x86_64/egg/pyfaidx/__init__.py", line 214, in __init__
IOError: [Errno 2] No such file or directory: 'Paenibacillus_alginolyticus_DSM_5050'
ADD REPLY
0
Entering edit mode

See the updated command above. Note that order of arguments matters.

ADD REPLY
0
Entering edit mode
faidx: error: unrecognized arguments: Paenibacillus_sp._1-49

I also have tried with simpler data, but no output has been generated (I also have utilized the -o option).

ADD REPLY
0
Entering edit mode

For future online interactions, it is usually preferred to tell people what you did (what commands you tried), the error, and show some of the data you're working with. At this point I see no reason why you should not have a solution that works.

ADD REPLY
0
Entering edit mode

Firstly, thanks. I had two solutions, as you can see. I just tried to give you a feedback, since you expend some time helping me.

I have tested your script with the data I provided above (the only difference is the presence of underscores instead of spaces). The command is exactly what you send me (the file names are the same):

xargs faidx rdp_download_2882seqs.fasta --regex  < organisms.txt 2> missing.log > filtered.out

Here is the error message:

faidx: error: unrecognized arguments: Paenibacillus_sp._1-18 Paenibacillus_sp._1-49 Paenibacillus_sp._A9 Paenibacillus_sp._Aloe-11

Certainly your script recognized "Paenibacilus_borealis), but the output was empty.

I tested the same data with fasgrep, and I had a positive result.

ADD REPLY

Login before adding your answer.

Traffic: 2074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6