Question

Remove Redundant Sequences Fasta

1

Entering edit mode

11.2 years ago

empyrean999 ▴ 180

Hi.. i have big fasta file with redundant sequences like

Seq_1 this_is_the_description_of_seq1 ATGACCAAGAGATAGATAACG

> Seq_1 
ATGACCAAGAGATAGATAACG

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

> Seq_2 
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

The goal is to remove redundand sequence and keep the one which has description. the desired output for above example is

> Seq_1 this_is_the_description_of_seq1
ATGACCAAGAGATAGATAACG

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

Thanks for your help !

unix awk perl sequence fasta • 9.7k views

ADD COMMENT • link updated 7.8 years ago by Eslam Samir ▴ 110 • written 11.2 years ago by empyrean999 ▴ 180

1

Entering edit mode

Are they always ordered like in your example and/or do they always have the same name (but a possibly different description)?

ADD REPLY • link 11.2 years ago by Devon Ryan 105k

score 4 · Answer 1 · 2013-10-25

4

Entering edit mode

11.2 years ago

Pierre Lindenbaum 164k

remove spaces after '>'
linearize fasta: print whole line,first token, the length of the description, the sequence
sort on 1st token length of description (inverse)
sort+unique on first token using a stable sort to keep the longest descriptions at the top
keep the columns 1 (header) and 4 (sequence)
tr back to fasta

     $ sed 's/^>[  \t]*/>/' < input.fa | \ 
     awk -F ' ' '/^>/ { printf("\n%s\t%s\t%d\t",$0,$1,length($0));next;} { printf("%s",$0);} END { printf("\n");}' | \ 
     sort -t '  ' -k2,2 -k3,3nr  | \
     sort -t '    ' -k2,2 --stable -u | \
     cut -d '  ' -f 1,4 | tr "\t" "\n"

     >Seq_1 this_is_the_description_of_seq1
     ATGACCAAGAGATAGATAACG
     >Seq_2 this_is_the_description_of_seq2
     ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

ADD COMMENT • link 11.2 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Hi Pierre! There is a pure Awk solution (split the fasta records with RS, store the sequences and the descriptions in arrays, store the description if it exists):

awk 'BEGIN {RS = "> "} NR > 1 {if ($3) {desc[$1] = $2 ; seqs[$1] = $3} else seqs[$1] = $2} END {for (seq in seqs) print ">"seq, desc[seq]"\n"seqs[seq]}' input.fa

ADD REPLY • link 11.2 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

It won't scale ;-)

ADD REPLY • link 11.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Why do you think it won't scale? The memory footprint should be roughly equivalent to the size of the fasta file.

ADD REPLY • link 11.2 years ago by Frédéric Mahé ★ 3.2k

1

Entering edit mode

exactly: if your fasta is far too big, you cannot load it in memory :-)

ADD REPLY • link 11.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Oh, you are right :-) If the file is larger than the available memory, the solution would be to read the file over and over to find paired records. Slow, but memory friendly.

ADD REPLY • link 11.2 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Very helpful !! Thank you Pierre:)

ADD REPLY • link 11.2 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Quick and easy version.. thank you pierre

ADD REPLY • link 11.2 years ago by empyrean999 ▴ 180

score 1 · Answer 2 · 2013-10-25

1

Entering edit mode

11.2 years ago

PoGibas 5.1k

This is a quick answer, there might be faster and better ways of doing this.

 grep -v '>\|^$'  Original_file | sort -u > Sequences    # Extract only nucleotide sequences (non-redundant) 
 while read SEQUENCE; do
     grep -w -m 1 -B1 $SEQUENCE Original_file >> Wanted_output
 done < Sequences

ADD COMMENT • link 11.2 years ago by PoGibas 5.1k

1

Entering edit mode

I think you're missing a '>' in your while loop. It should read "grep -w -m 1 -B1 $SEQUENCE Original_file >> Wanted_output", no?

ADD REPLY • link 11.2 years ago by Mitch Bekritsky ★ 1.3k

0

Entering edit mode

Thanks! Fixed it.

ADD REPLY • link 11.2 years ago by PoGibas 5.1k

score 1 · Answer 3 · 2013-10-25

Another way to do it is to make use of the fact that keys are unique in perl hashes (or Python dictionaries, or C++ maps, etc.) to eliminate duplicate sequences. Here's an implementation in perl:

#!/usr/bin/perl

use strict;
use warnings;

use Getopt::Long;

my $fastaFile;

GetOptions(
    "fasta=s"    => \$fastaFile,
);

open FA, '<', $fastaFile or die "Could not open $fastaFile: $!";

my $fa_hash;

my $seq;
my $hdr;

ENTRY:while(my $line = <FA>)
{
    next ENTRY if $line =~ m/^$/; #skip empty lines
    #The first line that gets read in should be a header line
    #otherwise, we can write a check to make sure it is
    chomp $line;
    $hdr = $line if $line =~ m/^>/;
    $line = <FA>;
    chomp $line;
    $seq = $line;

    #if the sequence is already in the hash, see if the current header is longer (which I'm using as a proxy for having a description)
    if(defined $fa_hash->{$seq})
    {
        $fa_hash->{$seq} = $hdr if length($hdr) > length($fa_hash->{$seq});
    }
    else
    {
        $fa_hash->{$seq} = $hdr;
    }
}

foreach my $key (keys %{$fa_hash})
{
    print $fa_hash->{$key},"\n";
    print $key,"\n\n";
}

The output would be

> Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

> Seq_1 
ATGACCAAGAGATAGATAACG

Assuming Seq_1 had no description. Although I have to say, Pierre's bash solution is really nice too...

score 0 · Answer 4 · 2013-10-25

Another option is to use Bio::SeqIO and a hash which pairs the id with the larger description--used as a filter. This method makes only two passes through the fasta file (no sorting):

use strict;
use warnings;
use Bio::SeqIO;

my ( $file, %hash, %seen ) = shift;

for my $i ( 0 .. 1 ) {
    my $in = Bio::SeqIO->new( -file   => $file, -format => 'Fasta' );

    while ( my $seq = $in->next_seq() ) {
        if ( !$i ) {
            $hash{ $seq->id } = $seq->desc if !defined $hash{ $seq->id } or length $seq->desc > length $hash{ $seq->id };
        }
        else {
            print '>'. $seq->id . ' ' . $hash{ $seq->id } . "\n" . $seq->seq . "\n" if !$seen{ $seq->id }++;
        }
    }
}

Usage: perl script.pl inFile [>outFile]

The last, optional parameter directs output to a file.

Output on your dataset:

>Seq_1 this_is_the_description_of_seq1 
ATGACCAAGAGATAGATAACG
>Seq_2 this_is_the_description_of_seq2
ATATTTTTGTAGTTTGACAATAAAATAATTAAAAATGTAAAAAATAAAAATCCCAAAATA

If you don't have access to the Bio::SeqIO module, here's a solution that produces the same output:

use strict;
use warnings;

my ( $file, %hash, %seen ) = shift;
local $/ = '>';

for my $i ( 0 .. 1 ) {
    push @ARGV, $file;
    while (<>) {
        chomp;
        next unless my ( $id, $desc, $seq ) = $_ =~ /(\S+)\s+([^\n]+)\s+(.+)/s;

        if ( !$i ) {
            $hash{$id} = $desc if !defined $hash{$id} or length $desc > length $hash{$id};
        }
        else {
            print ">$id $hash{$id}\n$seq" if !$seen{$id}++;
        }
    }
}

Although you show sequences only on one line, both of the scripts above handle multi-line sequences--just in case your actual dataset contains such.

Hope this helps!

score 0 · Answer 5 · 2015-06-02

0

Entering edit mode

9.6 years ago

pengchy ▴ 450

cat seq.fa |perl -ne 'chomp;s/>\s+/>/;if(/>(\S+)/){$id{$1}++;$id2=$1;}if($id{$id2}==1){print "$_\n"}' > seq_nr.fa

ADD COMMENT • link 9.6 years ago by pengchy ▴ 450

score 0 · Answer 6 · 2017-02-18

Here is my free program on Github Sequence database curator (https://github.com/Eslam-Samir-Ragab/Sequence-database-curator)

It is a very fast program and it can deal with:

Nucleotide sequences
Protein sequences

It can work under Operating systems:

Windows
Mac
Linux

It also works for:

Fasta format
Fastq format

Best Regards