Question

How To Remove The Identical Sequences In The Fasta Files?

1

Entering edit mode

10.2 years ago

izumi.kohsuke.10 ▴ 10

Some FASTA files have sequences with different IDs that nonetheless have the same sequence. I want to remove duplicate sequences based on the nucleotide sequence, rather than ID.HOW to do?

Additionally, I want to counts Ns as a match. For example,

>seq1
AATGC
>seq2
ANTGC
>seq3
AGATC

collapse identical sequences into a single sequence.

>seq1_seq2
AATGC
>seq3 
AGATC

HOW to do?

sequence • 5.3k views

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by izumi.kohsuke.10 ▴ 10

0

Entering edit mode

Thank you for answer.

ADD REPLY • link 10.2 years ago by izumi.kohsuke.10 ▴ 10

Ram · Answer 1 · 2014-09-24

2

Entering edit mode

10.2 years ago

Ido Tamir 5.2k

fastx collapse: http://hannonlab.cshl.edu/fastx_toolkit/ for identical reads
http://www.drive5.com/usearch/ for clustering with some identity

ADD COMMENT • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Ido Tamir 5.2k

0

Entering edit mode

Thank you for your answer. I will try it.

ADD REPLY • link 10.2 years ago by izumi.kohsuke.10 ▴ 10

Ram · Answer 2 · 2014-09-24

You might work with the following as a start:

#!/usr/bin/perl

use strict;
use warnings;

my $header = "";
my $seq = "";
my $fastaRef;

# read FASTA in from standard input
while(<>) {
    chomp;
    if (/>/) {
        if (length $seq > 0) {
            push(@{$fastaRef->{$seq}}, $header);
        }
        $header = substr $_, 1;
    }
    else {
        $seq .= $_;
    }
}
push (@{$fastaRef->{$seq}}, $header);

# print out merged-header sequences
foreach my $seqKey (keys %{$fastaRef}) {
    my @headers = @{$fastaRef->{$seqKey}};
    print STDOUT ">".join("_", @headers)."\n".$seqKey."\n";
}

One thought is to pre-process the FASTA input to "explode" each N-sequence to four (or power-of-four) ACTG-sequences. Each exploded sequence is given a modified header name. When merging, you can filter out any unmerged N-sequences based on the modified header name; the matched sequence will merge its header with another matched sequence header.

>seq1
AATGC
>seq2
ANTGC
>seq3
AGATC

This could become something like:

>seq1
AATGC
>seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC

When you merge this modified FASTA input with the above script, you get something like this:

>seq1_seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC

Your last step is to remove *-modN sequences that only have one match (to themselves) to get a final answer:

>seq1_seq2-modA
AATGC
>seq3
AGATC

This should be an easy modification of the script above, looking for any sequences with a one-element array and then a *-modN suffix.

score 0 · Answer 3 · 2014-09-24

0

Entering edit mode

10.2 years ago

jgbradley1 ▴ 110

If you're just looking to do a very basic implementation for a short file, then I would go with the naive approach. It would be faster to implement, although slower. For each sequence, compare it to every other sequence in the file 1 by 1. If you consider the Ns a match, then just don't check for a match when you encounter a N as you iterate through the two sequences. This is basically a pairwise approach. If you're looking for a more efficient solution however, use suffix trees.