Extract Fasta Sequences Using Wild Card Search
3
4
Entering edit mode
11.8 years ago
empyrean999 ▴ 180

I wanted to extract fasta sequences using the wild card search

for example

grep "oxidase" input.fasta

The above statement provides all headers which match with oxidase and i had to extract the sequences as well.

For now how i was approaching is that i am writing above statements in to file and using some awk statements to remove unwanted characteres but with this approach, it becoming tiresome to execute all these commands as format keep changing. And using the file, i use below statement to extract sequences

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file

Does anyone have direct solution for this??

fasta awk perl unix • 6.3k views
ADD COMMENT
0
Entering edit mode

What do you mean by "direct solution"? (and what formats keep changing?)

ADD REPLY
0
Entering edit mode

direct solution means that any script which takes wild card as input and provides output with sequences..

">testcomp2344c0seq1 BlastHit=ref|NP182170.1| inner membrane OXA1-like protein [Arabidopsis thaliana]sp|Q9SKD3.1|OXA1L_ARATH RecName: Full=Mitochondrial inner membrane protein OXA1-like; Flags: Precursorgb|AAD23047.1| putative cytochrome oxidase biogenesis protein "

That was one of the header i have but the script works with fasta id only which is "testcomp2344c0_seq1 " . so i have to perform preprocessing to remove rest of it.

ADD REPLY
2
Entering edit mode
11.8 years ago
awitney ▴ 30

I would use BioPerl for this. Something like (untested):

use Bio::SeqIO;

my $infile  = $ARGV[0];
my $string  = $ARGV[1];

my $seqin  = Bio::SeqIO->new( -format => 'fasta', -file => $infile );
my $seqout = Bio::SeqIO->new( -format => 'fasta', -fh =>\*STDOUT );

while ( my $seq = $seqin->next_seq() ) {
    my $id     = $seq->id;
    my $description     = $seq->description;

    if ( $id =~ m/$string/i ||  $description =~ m/$string/i ) {
        $seqout->write_seq($seq);
    }
}
ADD COMMENT
3
Entering edit mode

Excellent use of Bio::SeqIO!

If I may make a few observations...

  • Always use strict; use warnings;.
  • This limits the OP to searching a single file at a time.
  • You can match directly on $seq->id and $seq->description.
  • It's best to quotemeta on $string.

Given the above, consider the following:

use strict;
use warnings;
use Bio::SeqIO;

my $string = pop;
my $seqout = Bio::SeqIO->new( -format => 'Fasta', -fh => \*STDOUT );

for my $inFile (@ARGV) {
    my $seqin = Bio::SeqIO->new( -format => 'Fasta', -file => $inFile );

    while ( my $seq = $seqin->next_seq() ) {

        if (   $seq->id =~ m/\Q$string\E/i
            or $seq->description =~ m/\Q$string\E/i )
        {
            $seqout->write_seq($seq);
        }
    }
}

Since the search term is always the last item sent to the script, you can pop it off @ARGV (implied if no argument is passed to pop). File names are left in @ARGV, so the above iterates through each. Depending upon the file system, one could perl script.pl * oxidase on a dir of fasta files--or just write out the separate files.

ADD REPLY
0
Entering edit mode

Yes I assumed he would add the use strict/warnings and the shebang line. I was just trying to give the OP an idea of how to do it, to be adjusted to his exact wishes. Thanks for the observations though.

ADD REPLY
2
Entering edit mode
11.8 years ago
Kenosis ★ 1.3k

Here's another option:

use strict;
use warnings;

my $term = join '|', map "\Q$_\E", split ' ', pop;
my $found;

while (<>) {
    if (/^>/) {
        $found = /$term/i ? 1 : 0;
        next;
    }
    print if $found;
}

Usage: perl script.pl file(s) 'searchTerm [searchTerm]' [>outFile]

Sample fasta:

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>testcomp2344c0seq1 BlastHit=ref|NP182170.1| inner membrane OXA1-like protein [Arabidopsis thaliana]sp|Q9SKD3.1|OXA1L_ARATH RecName: Full=Mitochondrial inner membrane protein OXA1-like; Flags: Precursorgb|AAD23047.1| putative cytochrome oxidase biogenesis protein
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK

Output searching on 'oxidase':

KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM

Entering space-delimited search terms acts as an OR search. Remove next; if you want to print the header.

ADD COMMENT
1
Entering edit mode

Cool script, works as described. BTW, add another:

print if $found;

right above 'next' to obtain entire FASTA records, including ids.

ADD REPLY
2
Entering edit mode
11.8 years ago

Biopieces (www.biopieces.org) are really good at this kind of exercise:

Simple pattern matching of the sequence name:

read_fasta -i in.fna | grab -p <my pattern> -k SEQ_NAME | write_fasta -o out.fna -x

Using Regex:

read_fasta -i in.fna | grab -r <my pattern> -k SEQ_NAME | write_fasta -o out.fna -x

grab does a bunch of clever things - check it out.

ADD COMMENT

Login before adding your answer.

Traffic: 1881 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6