Question

Extract Fasta Sequences Using Wild Card Search

4

Entering edit mode

12.2 years ago

empyrean999 ▴ 180

I wanted to extract fasta sequences using the wild card search

for example

grep "oxidase" input.fasta

The above statement provides all headers which match with oxidase and i had to extract the sequences as well.

For now how i was approaching is that i am writing above statements in to file and using some awk statements to remove unwanted characteres but with this approach, it becoming tiresome to execute all these commands as format keep changing. And using the file, i use below statement to extract sequences

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file

Does anyone have direct solution for this??

fasta awk perl unix • 6.6k views

ADD COMMENT • link updated 12.2 years ago by Martin A Hansen 3.0k • written 12.2 years ago by empyrean999 ▴ 180

0

Entering edit mode

What do you mean by "direct solution"? (and what formats keep changing?)

ADD REPLY • link 12.2 years ago by fo3c ▴ 450

0

Entering edit mode

direct solution means that any script which takes wild card as input and provides output with sequences..

That was one of the header i have but the script works with fasta id only which is "testcomp2344c0_seq1 " . so i have to perform preprocessing to remove rest of it.

ADD REPLY • link 12.2 years ago by empyrean999 ▴ 180

score 2 · Answer 1 · 2013-02-19

2

Entering edit mode

12.2 years ago

awitney ▴ 30

I would use BioPerl for this. Something like (untested):

use Bio::SeqIO;

my $infile  = $ARGV[0];
my $string  = $ARGV[1];

my $seqin  = Bio::SeqIO->new( -format => 'fasta', -file => $infile );
my $seqout = Bio::SeqIO->new( -format => 'fasta', -fh =>\*STDOUT );

while ( my $seq = $seqin->next_seq() ) {
    my $id     = $seq->id;
    my $description     = $seq->description;

    if ( $id =~ m/$string/i ||  $description =~ m/$string/i ) {
        $seqout->write_seq($seq);
    }
}

ADD COMMENT • link 12.2 years ago by awitney ▴ 30

3

Entering edit mode

Excellent use of Bio::SeqIO!

If I may make a few observations...

Always use strict; use warnings;.
This limits the OP to searching a single file at a time.
You can match directly on $seq->id and $seq->description.
It's best to quotemeta on $string.

Given the above, consider the following:

use strict;
use warnings;
use Bio::SeqIO;

my $string = pop;
my $seqout = Bio::SeqIO->new( -format => 'Fasta', -fh => \*STDOUT );

for my $inFile (@ARGV) {
    my $seqin = Bio::SeqIO->new( -format => 'Fasta', -file => $inFile );

    while ( my $seq = $seqin->next_seq() ) {

        if (   $seq->id =~ m/\Q$string\E/i
            or $seq->description =~ m/\Q$string\E/i )
        {
            $seqout->write_seq($seq);
        }
    }
}

Since the search term is always the last item sent to the script, you can pop it off @ARGV (implied if no argument is passed to pop). File names are left in @ARGV, so the above iterates through each. Depending upon the file system, one could perl script.pl * oxidase on a dir of fasta files--or just write out the separate files.

ADD REPLY • link 12.2 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Yes I assumed he would add the use strict/warnings and the shebang line. I was just trying to give the OP an idea of how to do it, to be adjusted to his exact wishes. Thanks for the observations though.

ADD REPLY • link 12.2 years ago by awitney ▴ 30

score 2 · Answer 2 · 2013-02-19

Here's another option:

use strict;
use warnings;

my $term = join '|', map "\Q$_\E", split ' ', pop;
my $found;

while (<>) {
    if (/^>/) {
        $found = /$term/i ? 1 : 0;
        next;
    }
    print if $found;
}

Usage: perl script.pl file(s) 'searchTerm [searchTerm]' [>outFile]

Sample fasta:

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>testcomp2344c0seq1 BlastHit=ref|NP182170.1| inner membrane OXA1-like protein [Arabidopsis thaliana]sp|Q9SKD3.1|OXA1L_ARATH RecName: Full=Mitochondrial inner membrane protein OXA1-like; Flags: Precursorgb|AAD23047.1| putative cytochrome oxidase biogenesis protein
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK

Output searching on 'oxidase':

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM

Entering space-delimited search terms acts as an OR search. Remove next; if you want to print the header.

score 2 · Answer 3 · 2013-02-19

Biopieces (www.biopieces.org) are really good at this kind of exercise:

Simple pattern matching of the sequence name:

read_fasta -i in.fna | grab -p <my pattern> -k SEQ_NAME | write_fasta -o out.fna -x

Using Regex:

read_fasta -i in.fna | grab -r <my pattern> -k SEQ_NAME | write_fasta -o out.fna -x

grab does a bunch of clever things - check it out.