How To Filter Multi Fasta By Length??
4
7
Entering edit mode
11.4 years ago
HG ★ 1.2k

I save this bit of code in a file called removesmalls, made it executable chmod +x

#!/bin/awk -f
!/^>/ {next}
 {getline s}
length(s) >= i { print $0 “\n” s }

and then i used it with the following command

awk removesmalls i=200 contigs.fa > contigs-l200.fa

Result :Now its generating the out put file also but without any data

Please help me out.

awk • 43k views
ADD COMMENT
4
Entering edit mode

Hi, you can change the record separator RS into > and turn your script into a short one-liner:

awk 'BEGIN {RS = ">" ; ORS = ""} length($2) >= 200 {print ">"$0}' contigs.fa > contigs-l200.fa

If you want the minimum size to be an external variable, use the -v option:

awk -v min="200" 'BEGIN {RS = ">" ; ORS = ""} length($2) >= min {print ">"$0}' contigs.fa > contigs-l200.fa
ADD REPLY
0
Entering edit mode

Thank you so much

ADD REPLY
28
Entering edit mode
11.4 years ago

Your original awk script assumes that the FASTA sequence following the header line spans only one line, which is not normally the case. As suggested by @SES above, bioawk is definitely a neat alternative.

A traditional Perl based approach would be like so:

## removesmalls.pl
#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if($seqlen >= $minlen);
    }
    local $/="\n";
}

You would then invoke the script like so:

perl removesmalls.pl 200 contigs.fasta > contigs-l200.fasta

This solution is making use of the Perl record separator (which is a "\n" newline by default) and switching it to ">" so that when iterating through the FASTA file, we encounter whole FASTA records. This addresses the issue I mentioned above.

ADD COMMENT
0
Entering edit mode

Thank you so much for your assessment ...

ADD REPLY
0
Entering edit mode

Hi this script is really helpful. I have run it with my fasta files (Assembled transcriptomes) removing contigs less than 300 and the output files do not have these contigs.

Is there a way to print an output file for contigs with a minimum and maximum lenght for example for 300 to 600 bp. I need to do this task to separate (filter) my fasta files by contig length: 1) to see how many contigs of a specific lenght are (>300, 300 to 600, 600 to 1200, 1200 to 1800,,, etc) 2)These filtered files i will blast against a Uniprot dataset to see how many of these hit off?

Thanks for your help

ADD REPLY
2
Entering edit mode

You can extend the script above to include a maxlen parameter to extract sequences within a range of lengths, like so:

# filterfastarange.pl
#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
my $maxlen = shift or die "Error: `maxlen` parameter not provided\n";
{
    local $/=">";
    while(<>) {
        chomp;
        next unless /\w/;
        s/>$//gs;
        my @chunk = split /\n/;
        my $header = shift @chunk;
        my $seqlen = length join "", @chunk;
        print ">$_" if($seqlen >= $minlen and $seqlen <= maxlen);
    }
    local $/="\n";
}

You would the invoke this script like so:

perl filterfastarange.pl 301 600 contigs.fasta > contigs-gt300-lte600.fasta

Please note, the if() condition above assumes a closed interval [minlen, maxlen], i.e. sequence length is greater than or equal to minlen and less than or equal to maxlen. So your filter intervals should be defined appropriately: [301, 600], [601, 1200], [1201, 1800] and so on, so as to not cause any overlaps between the consecutive sequence sets.

You could easily extend this script to loop through your desired ranges.

Hope this helps!

ADD REPLY
0
Entering edit mode

I've been trying to figure out a way to do this for a few days now, this is super helpful, thank you!

ADD REPLY
0
Entering edit mode

The above script works for me but there is a problem, i have 109 different fasta files with different file names. i can run the script for individual file but i want to run the script at once for all files and the result file should be individually different for each.

file names are like SRR8224532.fasta, SRR8224533.fasta, SRR8224534.fasta, and so on i want the result files after removing the contigs (i.e., for me less than 1000) something like SRR8224532-out.fasta, SRR8224533-out.fasta, and so on.

Any help or suggestion would be helpfull.

ADD REPLY
0
Entering edit mode

@subrahmanya, you should be able to achieve this via the following loop:

for fasta_file in $(find . -name "SRR*.fasta" -type f); do
    output_fasta_file="$(basename $fasta_file .fasta)-out.fasta"
    perl removesmalls.pl 1000 $fasta_file > $output_fasta_file
done

In the above code, the find command assumes that all your FASTA files are located in the current directory, as signified by the dot (.)

ADD REPLY
11
Entering edit mode
11.4 years ago
SES 8.6k

I recommend you try bioawk, which is a modified version of awk that will parse some common sequence formats. In my opinion, there is no reason to mess with trying to parse files if someone has already figured it out and created some easy to use methods. Also, I would argue against writing a script for simple tasks when a single line of code will do the job. Here is a one-liner to get sequences over 50 bp (will take fasta or fastq):

bioawk -c fastx '{ if(length($seq) > 50) { print ">"$name; print $seq }}' some_seq.fastq
ADD COMMENT
0
Entering edit mode

bioawk i tried earlier : I found some problem to install it. Once i am doing like this

bioawk -c fastx '{ if(length($seq) > 1000) { print ">"$name; print $seq }}' list4a.fasta 
bioawk: command not found

Then i tried to install bioawk like this

make 
make: `awk' is up to date.

I could not solve it. Could you please check it once?

ADD REPLY
0
Entering edit mode

I see, you will need to compile bioawk first, then create a link to awk and name it bioawk. This is not strictly necessary, but I do this so bioawk does not conflict with the system awk (both are named 'awk'). After you type make to compile it, just create a link ln -s awk bioawk and try again. Your shell will not know it's there so you'll have to add it to your PATH, or specify the full path to bioawk.

ADD REPLY
0
Entering edit mode
ln -s awk bioawk
ln: failed to create symbolic link `bioawk': File exists
ADD REPLY
0
Entering edit mode

There is no file named 'bioawk' in the distribution, which indicates you typed that command more than once. What did you see when you tried to execute the named file that exists?

ADD REPLY
0
Entering edit mode

bioawk seems to be a very nice tool!

ADD REPLY
0
Entering edit mode

and using awk filtering:

bioawk -c fastx '(length($seq) > 50) { print ">"$name"\n"$seq }'
ADD REPLY
1
Entering edit mode
11.4 years ago
SRKR ▴ 180

Here is a PHP solution for your problem using Regular Expression. I have used a limit of length 30 here, you change it as per your wish. The sequences will be in fasta format in seqs.txt file and the resulting screened sequences will be written to screened_seqs.txt:

<?php
$readfile = "seqs.txt";
$writefile = "screened_seqs.txt";
$screenlen = 30;
$fr = fopen($readfile, "r");
$text = fread($fr, filesize($readfile));
$fw = fopen($writefile, "w");
$ntext = preg_replace("/\>[^\n]{1,}\n[ATGCatgc\n]{1,".($screenlen-1)."}\>/", ">", $text);
fwrite($fw, $ntext);
fclose($fr);
fclose($fw);
?>

Hope this helps you.

ADD COMMENT
0
Entering edit mode

Thanks a lot.....

ADD REPLY
1
Entering edit mode
11.4 years ago
cts ★ 1.7k

In the text you give you use typographers quotes (the curly double quotes), which may be stuffing things up also could it be that your missing the -f between awk and removesmalls. Copying your text and using the following command worked for me:

awk -f removesmalls i=200 test.fa >someoutput.fa

also you should be able to do:

./removesmalls i=200 test.fa >someoutput.fa

ADD COMMENT
0
Entering edit mode

awk -f removesmalls i=200 list4a.fasta > list4aout.fasta
i tried with this command but same result it is generating file without any data Here is the link from where i got the code http://bioinfofly.wordpress.com/2012/07/09/binawk-f-nex/ colud you please ckeck once in your system

ADD REPLY
0
Entering edit mode

Simply copying the code from your question or from the wordpress article produces an error for me. As I say in my answer above, the text in your question contains typographers quotes (also called smart quotes or curly quotes) around the \n in your print statement. You need to remove them and replace them with normal quotes. This can be simply done by deleting those two characters and replacing them with a standard ". With this modification and with the -f change the code you give works for me

ADD REPLY

Login before adding your answer.

Traffic: 1140 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6