How To Remove The Identical Sequences In The Fasta Files?
3
1
Entering edit mode
10.2 years ago

Some FASTA files have sequences with different IDs that nonetheless have the same sequence. I want to remove duplicate sequences based on the nucleotide sequence, rather than ID.HOW to do?

Additionally, I want to counts Ns as a match. For example,

>seq1
AATGC
>seq2
ANTGC
>seq3
AGATC

collapse identical sequences into a single sequence.

>seq1_seq2
AATGC
>seq3 
AGATC

HOW to do?

sequence • 5.3k views
ADD COMMENT
0
Entering edit mode

Thank you for answer.

ADD REPLY
2
Entering edit mode
10.2 years ago
Ido Tamir 5.2k
ADD COMMENT
0
Entering edit mode

Thank you for your answer. I will try it.

ADD REPLY
1
Entering edit mode
10.2 years ago

You might work with the following as a start:

​#!/usr/bin/perl

use strict;
use warnings;

my $header = "";
my $seq = "";
my $fastaRef;

# read FASTA in from standard input
while(<>) {
    chomp;
    if (/>/) {
        if (length $seq > 0) {
            push(@{$fastaRef->{$seq}}, $header);
        }
        $header = substr $_, 1;
    }
    else {
        $seq .= $_;
    }
}
push (@{$fastaRef->{$seq}}, $header);

# print out merged-header sequences
foreach my $seqKey (keys %{$fastaRef}) {
    my @headers = @{$fastaRef->{$seqKey}};
    print STDOUT ">".join("_", @headers)."\n".$seqKey."\n";
}

One thought is to pre-process the FASTA input to "explode" each N-sequence to four (or power-of-four) ACTG-sequences. Each exploded sequence is given a modified header name. When merging, you can filter out any unmerged N-sequences based on the modified header name; the matched sequence will merge its header with another matched sequence header.

>seq1
AATGC
>seq2
ANTGC
>seq3
AGATC

This could become something like:

>seq1
AATGC
>seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC

When you merge this modified FASTA input with the above script, you get something like this:

>seq1_seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC

Your last step is to remove *-modN sequences that only have one match (to themselves) to get a final answer:

>seq1_seq2-modA
AATGC
>seq3
AGATC

This should be an easy modification of the script above, looking for any sequences with a one-element array and then a *-modN suffix.

ADD COMMENT
0
Entering edit mode

Thank you for your answer. That's a good idea. but some sequences have many N, so modified sequence increase more.

ADD REPLY
0
Entering edit mode

True, a sequence with n N will have 4^n modified sequences. Still, I think this approach should work reasonably well for small inputs.

ADD REPLY
0
Entering edit mode

I think so too. If there is an opportunity, I would like to use this approach.

ADD REPLY
0
Entering edit mode
10.2 years ago
jgbradley1 ▴ 110

If you're just looking to do a very basic implementation for a short file, then I would go with the naive approach. It would be faster to implement, although slower. For each sequence, compare it to every other sequence in the file 1 by 1. If you consider the Ns a match, then just don't check for a match when you encounter a N as you iterate through the two sequences. This is basically a pairwise approach. If you're looking for a more efficient solution however, use suffix trees.

ADD COMMENT
0
Entering edit mode

Thank you for your answer.

ADD REPLY

Login before adding your answer.

Traffic: 3073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6