Question

How to copy all fasta-seqs from fasta-files with the seq-lengths between minlen and maxlen

0

Entering edit mode

10.6 years ago

natasha.sernova ★ 4.0k

Dear all, happy New Year!

I need to cut all fasta-seqs with their headers (reading consequtively many fasta-files) with the seq-lengths between minlen and maxlen. For one file at the bottom the script seems to work.

I've seen this example somewhere in biostar, thanks to the author! I've just modified it a little bit - it was just for a single length limit. but I have about one hundred fasta files with different names, a sequence length range (>50 and <100) and I am very poor in awk - I cannot make this shorter. Could you, please, help me! Many thanks in advance!

#!/usr/bin/perl
use strict;
use warnings;

#to cut a fasta-seqs from fasta-files with the seq-lengths between minlen and maxlen

# the script: FRAGM_both_values_fasta_trial.pl
my $minlen = $ARGV[0];
my $maxlen = $ARGV[1];

{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if(($seqlen >= $minlen)&&($seqlen <= $maxlen));
    }
    local $/="\n";
}

#Then invoke the script like that:
# perl FRAGM_both_values_fasta_trial.pl 5 10 example_fasta.txt > example_fasta_5_10.txt
# I make the range narrower - from 5 to 10 for training.

example_fasta.txt
>one
ETGT
>TWO
DGJLKTFJG
>THREE
DHSFRUTYIPUTE
>FOUR
DGFJTI
>FIVE
ADKLPFGGHH

perl fasta • 6.1k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Thank you very much, I will try.

Best wishes,

Natasha

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

10.6 years ago

YOT ▴ 30

You can try this code https://github.com/igoralves1/readFASTA_PHP

It is a easy php code.

Go to your localhost and create your directory, lets say "myfastarange".
Inside "myfastarange" create a directory lets say "fastadir" and put all your fasta files inside it.
Download index.php from the site above.
Inside index.php you will see a function like this separateFASTA::separete_n_FASTA_Seqs_InsideDir("./fastadir",250 ,380); that you will set your directory you have created the min size and the max size.
Run your index.php from the browser like this: localhost/myfastarange/index.php

The code will create inside your fastadir a new directory called result. Inside result you will see 3 files all sequences less that your min, another file with all sequences betweeen the max and the min and the last one with all sequences more than the max.

Note: this code works with any sequence. That means RNA or DNA or protein.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by YOT ▴ 30

0

Entering edit mode

Thank you very much, it's a nice idea to check.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by natasha.sernova ★ 4.0k

Ram · Accepted Answer · 2015-01-02

4

Entering edit mode

10.6 years ago

Brian Bushnell 20k

I've written a program that will do this efficiently:

reformat.sh in=reads.fasta out=filtered.fasta minlen=51 maxlen=99

It runs at hundreds of MB/s so will save time if you have large files, and accepts all common formats, not just fasta.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Dear Brian,

Thank you very much! But my antivirus has blocked the downloading, so I cannot look at your program and try it. I don't know the reason why my antivirus becomes crazy from time to time, sorry! I could do something wrong, please, give me the instructions how to download it without any problems. Many-many thanks!

Sincerely yours,

Natasha

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

I will email it to you, if you send your address to bbushnell at lbl ot gov (dot intentionally misspelled as "ot").

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by Brian Bushnell 20k

Ram · Accepted Answer · 2015-01-02

2

Entering edit mode

10.6 years ago

Ram 45k

This is Perl. Sorry to say this, but could you try mine so I can help you out? It is a bit cleaner than this script.

https://github.com/RamRS/myPerlScripts/blob/master/pickLenMtoN.pl

Also, it would be great if you could explain your question - what you're looking to achieve.

ADD COMMENT • link 3.4 years ago by Ram 45k

0

Entering edit mode

Thank you very much! Please, tell me what your name is - I must know whom to acknowledge.

Bioperl... With diagram... It looks better than I could imagine. Will it be possible to reach the separate protein headers? I need exactly this for the proteins within 50-100 range. Sorry, I am too tired to check it right now.

Sincerely yours,

Natasha

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Hi, my name is Ram. I'm not sure what you mean by diagram. I have implemented it using BioPerl and argparse - should avoid clunky $ARGV[i] values and use recognizable argument names.

What do you mean, reach separate protein headers? Could you maybe elaborate on that a bit please?

ADD REPLY • link 3.4 years ago by Ram 45k

0

Entering edit mode

There are a number of bugs in this script. You are using a (non-core) module to parse options but then you are checking @ARGV, so the script exits no matter what options are given. If it did work, it appears that it will print a blank line before the first record. The command line arguments (-w) in a script are discouraged because that has far greater scope than you may realize. Also, it is faster and safer to simply use "=" to set lexical values rather than the loop/topic method used here, which is unnecessary.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by SES 8.6k

0

Entering edit mode

Thank you for the input. I'll incorporate these one by one, if you're willing to walk me through them. I'll now fix the mixed use of argparse and @ARGV.

Also, what do you mean by "command line arguments are discouraged"?

ADD REPLY • link 3.4 years ago by Ram 45k

1

Entering edit mode

For the last question, I was referring to the use of "-w" in the shebang line, which is a command line flag to the Perl interpreter. It says "turn on warnings everywhere, including all imported modules." This is different from putting use warnings; in a script, which says "turn on warnings in this block only." This is described in the perldoc for warnings. I used to do the first one because I saw it in every script, but the latter gives you what you want, warnings for your script only. The typical behavior is enabled by default in the latest Perls.

For the options, I would encourage the use of Getopt::Long since it has been part of core Perl forever, or maybe use Getopt::Long::Descriptive for a OO interface that does a lot of option processing, usage statement, and error handling for you. What I meant with the blank line is you have \n>$seqid and you want to put the newline after the ID and SEQ, which can be done with a join statement. For setting values, I was referring to the for loop with the use of the topic $_ which is unnecessarily complicated and it will not be as fast as my ($firstct, $secondct, $thirdct) = (0, 0, 0);. The speed thing should not really be a concern, but readability should and this form is easier to understand. Coincidentally, defining lexical variables in one line like this is optimized to be a bit faster than defining them separate, but I think that is only true in recent versions of Perl.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by SES 8.6k

0

Entering edit mode

Thank you for the detailed explanation. I changed the loop-topic and new line errors, and I'll look into the -w and Getopt ones now.

ADD REPLY • link 3.4 years ago by Ram 45k

Ram · Accepted Answer · 2015-01-03

Just for reference, you don't need to write a script for this type of task. If you are going to use bioperl, a one-liner will do the trick:

perl -MBio::SeqIO -E '$seqio = Bio::SeqIO->new(-fh => \*STDIN); while ($seq = $seqio->next_seq) { say join "\n", ">".$seq->id, $seq->seq if ($seq->length >= $ARGV[0] && $seq->length <= $ARGV[1]); }' 10 20 < seqs.fas

If you aren't sure what that does it is better to put the code in a script and enable warnings so you can easily find issues.

Ram · Accepted Answer · 2015-01-04

2

Entering edit mode

10.6 years ago

lh3 33k

To complete your example:

#!/bin/env perl
use strict;
use warnings;

my $min = shift(@ARGV);
my $max = shift(@ARGV);

$/ = ">"; <>; $/ = "\n";
while (<>) { # read name
    my $name = $_;
    $/ = ">"; $_ = <>; # read sequence
    chomp; # trim ">"
    tr/\n//d; # delete "\n"
    if (length($_) >= $min && length($_) <= $max) {
        print ">$name";
        print "$_\n";
    }
    $/ = "\n";
}

This is not as robust as bioperl, but is ~20X faster on my input file.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by lh3 33k

0

Entering edit mode

Thank you very much! It is much easier to understand!

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.6 years ago by natasha.sernova ★ 4.0k