Question

How To Remove The Same Sequences In The Fasta Files?

45

Entering edit mode

14.1 years ago

Zhangleisdau ▴ 340

Some FASTA files (e.g., ESTs) have sequences with different IDs that nonetheless have the same sequence. I want to remove duplicate sequences based on the nucleotide sequence, rather than ID.HOW to do?Thankyou to all!

fasta sequence duplicates • 76k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 14.1 years ago by Zhangleisdau ▴ 340

0

Entering edit mode

What is your understanding of "the same" are your including approximate stringmatching?

ADD REPLY • link 12.8 years ago by Fabian Bull ★ 1.3k

0

Entering edit mode

In fact you might not want to remove all the duplicated sequences but collapse them into a single sequence. I guess you meant to say it like this but your question is not clearly stating it.

ADD REPLY • link 12.8 years ago by Michael 55k

Ram · Answer 1 · 2010-10-21

49

Entering edit mode

14.1 years ago

brentp 24k

you can do this with fastx-toolkit

usage is like:

fastx_collapser < some.fasta > some.unique.fasta

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.1 years ago by brentp 24k

1

Entering edit mode

always love toolkits

ADD REPLY • link 14.1 years ago by Will 4.6k

1

Entering edit mode

bummer, apparently only works for nucleotide sequences. Where's the love for protein fasta files?

ADD REPLY • link 12.3 years ago by Andrew Su 4.9k

0

Entering edit mode

In my example (nucleotides), it removed well the duplicated but renamed the sequences...

ADD REPLY • link 7.4 years ago by Ludo Cottret • 0

0

Entering edit mode

Try Dedupe; it won't change the headers.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

Ram · Answer 2 · 2010-10-21

21

Entering edit mode

14.1 years ago

Pierre Lindenbaum 164k

linearize the sequences: surround the fasta headers by '@' and '#' , remove the CR, replace '#' by CR and '@' by '\t'

sort this tab delimited file on the second column (the sequence) , with case-insensible option, only the uniq columns

restore the fasta header and sequence

sed -e '/^>/s/$/@/' -e 's/^>/#/' file.fasta  |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t '  ' -f -k 2,2  |\
sed -e 's/^/>/' -e 's/\t/\n/'

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I think this is the only answer that actually works if your fasta file contains a multiple sequence alignment.

Both fastx_collapser and gt sequniq fail because they consider '–' to be an invalid character in a fasta sequence. I didn't ask those tools to validate the sequences, and that's not their job, but they did it anyway making them useless for their actual purpose. Have these authors never heard of the Robustness Principle?

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by Chris Warth ▴ 110

0

Entering edit mode

I know this is old stuff, but I would like to thank you for this solution. Its logical is perfect for a pipeline I'm building.

ADD REPLY • link 4.3 years ago by Frederico Guimarães • 0

0

Entering edit mode

How will you store the duplicates?

ADD REPLY • link 3.3 years ago by Edwin • 0

Ram · Answer 3 · 2010-10-21

There are quite a lot of utilities that will do this; usually they are found as part of larger software packages (often aimed at motif discovery). For example, RSA-tools contains the utility purge-sequence, MEME contains purge. So those are some terms for your web search.

A roll-your own solution is quite easy if you store the sequences as hash keys (which have to be unique). For example using Bioperl SeqIO you could try something like:

use strict;
use Bio::SeqIO;
my %unique;

my $file   = "myseqs.fa";
my $seqio  = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");

while(my $seqs = $seqio->next_seq) {
  my $id  = $seqs->display_id;
  my $seq = $seqs->seq;
  unless(exists($unique{$seq})) {
    $outseq->write_seq($seqs);
    $unique{$seq} +=1;
  }
}

This will write out a new FASTA file, myseqs.fa.uniq, with only unique sequences (but no record of the other IDs with that sequence).

Ram · Answer 4 · 2010-10-21

10

Entering edit mode

14.1 years ago

Darked89 4.7k

Depends on you application, but you may also try to filter out entries with the out same sequence but i.e. just 1bp shorter.

uclust --sort seqs.fasta --output seqs_sorted.fasta
uclust --input seqs_sorted.fasta --uc results.uc --id 1.00
uclust --uc2fasta results.uc --input seqs.fasta --output results.fasta --types S

This should give you results.fasta purged from identical but equal in size / shorter sequences.

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.1 years ago by Darked89 4.7k

0

Entering edit mode

good to know! uclust looks to be a useful tool.

ADD REPLY • link 14.0 years ago by brentp 24k

score 9 · Answer 5 · 2010-10-21

9

Entering edit mode

14.1 years ago

Haibao Tang 3.0k

Encode the sequences in hash using SHA-1 or MD5, and then check for collision. If you are familiar with Python, here is a useful recipe to start.

ADD COMMENT • link 14.1 years ago by Haibao Tang 3.0k

0

Entering edit mode

Sorry i was thinking to use the same approaches to refine big sequence file contains ~ 14.000 sequence of influenza virus isn't that will consume the memory "ram" is there other way?

ADD REPLY • link 11.5 years ago by Medhat 9.8k

Ram · Answer 6 · 2010-10-21

7

Entering edit mode

14.1 years ago

Will 4.6k

Here's a python script that should be able to do it:

from itertools import groupby

if __name__ == '__main__':

    ishead = lambda x: x.startswith('>')
    all_seqs = set()
    with open(inname) as handle:
        with open(oname, 'w') as outhandle:
            head = None
            for h, lines in groupby(handle, ishead):
                if h:
                    head = lines.next()
                else:
                    seq = ''.join(lines)
                    if seq not in all_seqs:
                        all_seqs.add(seq)
                        outhandle.write('%s\n%s\n' % (head, seq))

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.1 years ago by Will 4.6k

0

Entering edit mode

Does this script only collapse sequences that are completely overlapping or will it collapse sequences where there is only partial overlap but the overlapping region is redundant?

ADD REPLY • link 12.7 years ago by Rduncan ▴ 60

0

Entering edit mode

For this one it only checks for complete identity ... the if seq not in all_seqs only does at membership test for the whole sequence. However, with a little work you could probably modify this to look at sequence regions.

ADD REPLY • link 12.7 years ago by Will 4.6k

0

Entering edit mode

Wow, very clever. I don't fully understand a lot of the tool packages in python, but this is very intuitive. I modified the script slightly to strip the extra carriage returns in the outhandle.write command. This removes whitespaces that are aesthetically appealing but causes some programs to choke.

ADD REPLY • link updated 6.3 years ago by Ram 44k • written 10.1 years ago by joelrosenbaum • 0

Ram · Answer 7 · 2012-02-17

7

Entering edit mode

12.8 years ago

Manu Prestat 4.1k

Works with big sets, aa/prot seq files (at the opposite to fastx-toolkit), very fast and simple to use: genometools:

gt sequniq -o out.fasta in.fasta

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 12.8 years ago by Manu Prestat 4.1k

score 6 · Answer 8 · 2010-11-13

6

Entering edit mode

14.0 years ago

Khader Shameer 18k

Already nice answers:

I would recommend to try the CD-HIT version designed for EST / nucleotides (CD-HIT-EST and CD-HIT-EST-2D) for this purpose.

ADD COMMENT • link 14.0 years ago by Khader Shameer 18k

score 6 · Answer 9 · 2010-11-13

6

Entering edit mode

14.0 years ago

Dave Lunt ★ 2.0k

uclust is about the fastest and one of the most flexible. If your sequences are not exactly the same length, or you also want to cluster some sequences that are almost, but not exactly, the same then look carefully at the flexibility and options of the thing you choose.

ADD COMMENT • link 14.0 years ago by Dave Lunt ★ 2.0k

Ram · Answer 10 · 2010-11-13

6

Entering edit mode

14.0 years ago

Rm 8.3k

for smaller sets try this:

perl -ne 'BEGIN{$/=">";$"=";"}($d,$_)=/(.*?)\n(.+?)>?$/s;push @{$h{lc()}},$d if$_;END{for(keys%h){print">@{$h{$_}}$_"}}' multi.seq.fasta

ADD COMMENT • link updated 6.3 years ago by Ram 44k • written 14.0 years ago by Rm 8.3k

1

Entering edit mode

Wonderful solution!!! Thank you for sharing it.

ADD REPLY • link 12.2 years ago by deepthithomaskannan ▴ 390

0

Entering edit mode

and so readable too. ;) now which has "... all the visual appeal of oatmeal with fingernail clippings mixed in." ?? http://en.wikiquote.org/wiki/Larry_Wall

ADD REPLY • link 14.0 years ago by brentp 24k

Ram · Answer 11 · 2010-10-21

5

Entering edit mode

14.1 years ago

Hanif Khalak ★ 1.3k

According to SeqAnswers: WU-blast, EBI-Exonerate, and bioperl all have stand-alone programs to make an "nrdb" = non-redundant database of sequences.

Note: WU-blast is now being distributed commercially as AB-Blast - get a free personal license. Older archived versions can be found here.

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 14.1 years ago by Hanif Khalak ★ 1.3k

0

Entering edit mode

Strictly the nrdb program generates a non-identical database. The source for nrdb can be found in http://blast.advbiocomp.com/pub/nrdb/, no license required. This was a newer version than that bundled in WU-BLAST, no idea if this is still the case for AB-BLAST. Originally nrdb was used to produce NCBI's 'nr' and 'nt' databases. Of course 'nt' is no longer non-identical and 'nr' is produced using a slightly different process today, but nrdb is still a popular way to generate non-identical databases for use with BLAST.

ADD REPLY • link 12.8 years ago by Hamish ★ 3.3k

score 4 · Answer 12 · 2016-02-17

4

Entering edit mode

8.8 years ago

BioApps ▴ 800

On Windows you can use my Kitten Sequence Dereplicator (which by the way, was updated recently).

The program is based on CD-Hit which is pretty accurate and fast.

ADD COMMENT • link 8.8 years ago by BioApps ▴ 800

Ram · Answer 13 · 2015-03-17

3

Entering edit mode

9.7 years ago

Brian Bushnell 20k

I wrote a program for this problem, called Dedupe. It's very fast, and also (optionally) handles contained sequences. Usage:

dedupe.sh in=file.fasta out=nodupes.fasta

If you don't want fully-contained substrings removed, then add the flag ac=f (short for absorbcontainments=false).

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Really helpful, thank you Brian Bushnell it solved my desire to get unique sequences in my set

ADD REPLY • link 8.5 years ago by nhaituan ▴ 10

score 2 · Answer 14 · 2010-11-13

2

Entering edit mode

14.0 years ago

User 0726 ▴ 20

I have to say that the remove of duplicate sequences is not at al trivial, especialy when you consider real data with sequencing/alignment errors. Also problematic is that sequences are identical, but do not sully overlap. I can say with certainty that CD-HIT-EST will filter some duplicates, but far from all.

I am working on a pipeline that uses BLAST, filtering out those contigs that have significant similarity to more that just themselves.

ADD COMMENT • link 14.0 years ago by User 0726 ▴ 20

0

Entering edit mode

Why doesn't CD-HIT-EST filter all duplicates? Shouldn't it be possible to set this in the parameters?

ADD REPLY • link 12.1 years ago by Yannick Wurm ★ 2.5k

Ram · Answer 15 · 2015-03-17

2

Entering edit mode

9.7 years ago

pyjiang2 ▴ 40

I tried fastx_collapser first, but it gives error for multiple aligned fasta sequences.

I found this useful website, which gives unique fasta sequences and concatenate the header name for the same sequences as well: http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html_ncbi/html/fasta/uniqueseq.cgi

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by pyjiang2 ▴ 40

0

Entering edit mode

Great find! Only tool that I could find that works with amino acids.

ADD REPLY • link 7.7 years ago by tantrev ▴ 40

score 2 · Answer 16 · 2018-09-21

in R:

library(seqinr)

#when you dont use attributes to name sequences
fasnoa<-seqinr::read.fasta("fasta.fasta", set.attributes = F) 
#get names
seqnames<-names(fasnoa)
#detect dups
dups<-grep(TRUE,duplicated(fasnoa))
#eliminate duplicates    
namesnodups<-seqnames[-dups]
fasnoanodup<-fasnoa[-dups] 
#write file
seqinr::write.fasta(fasnoanodup,namesnodups,"seqinrnoafasta.fas",nbchar=10000)


# when you use attributes as name of sequences
fas<-seqinr::read.fasta("fasta.fasta")
# read attribute annot when they have names and you will use them 
namesfas<-lapply(fas, function(x) attr(x, "Annot") )
# delete attributes 
for (i in 1:length(fas) ){ 
attributes(fas[[i]])<-NULL }
# detect duplicates 
dups<-grep(TRUE,duplicated(fas))
# create object without duplicates 
fasnodup<-fas[-dups]
# create object with names 
namesnodups<-namesfas[-dups]
# modify names 
namesnodups<-gsub("\\s+|,","_",sub(">","",namesnodups) )
# write file 
seqinr::write.fasta(fasnodup,namesnodups,"seqinrfasta.fas",nbchar=10000)

Ram · Answer 17 · 2014-05-07

1

Entering edit mode

10.6 years ago

Shaun Jackman ▴ 420

seqmagick convert --deduplicate-sequences will remove duplicate sequence. See here:

http://fhcrc.github.io/seqmagick/convert_mogrify.html#examples

ADD COMMENT • link 10.6 years ago by Shaun Jackman ▴ 420

0

Entering edit mode

Hello Shaun:

I tried the seqmagic as you suggested, but it did not give what is expected.

>seq
ATCGATCGATATATATATAT
>seq2 part of seq1
CGATCGATATATATATA
>seq3 part of seq2
ATCGATATATAT
>seq4 reverse complementary of seq 2
TATATATATATCGATCG
>seq5 new seq
ATCGATCGACGATCGAGCGCG
>seq6 another new
ATCGATCGCGCGCGCGCGCGCGC
>seq7 psubstring of seq6
CGCGCGCGCGCGCGCG

Here is the command I used:

$ seqmagick convert --deduplicate-sequences test.fasta test_seqmagick.fasta

$ cat test_seqmagick.fasta
>seq1
ATCGATCGATATATATATAT
>seq2 part of seq1
CGATCGATATATATATA
>seq3 part of seq2
ATCGATATATAT
>seq4 reverse complementary of seq 2
TATATATATATCGATCG
>seq5 new seq
ATCGATCGACGATCGAGCGCG
>seq6 another new
ATCGATCGCGCGCGCGCGCGCGC
>seq7 psubstring of seq6
CGCGCGCGCGCGCGCG

Did I miss anything?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by yifangt86 ▴ 60

0

Entering edit mode

yifangt, one of us is certainly confused. None of your input sequences are duplicated. Feed seqmagick a file with some duplicated sequences and it will work.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by Chris Warth ▴ 110

0

Entering edit mode

Looks like he also wanted to remove contained sequences and reverse complements. That can be done with Dedupe (see the dedupe post on this page).

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian!

Yes, what I meant is to dedupe sequences with the original and reverse complemented sequences, and the containment of both strands. Not sure how necessary this job is for any application like assembly, but that's another question.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by yifangt86 ▴ 60

0

Entering edit mode

It's not usually necessary unless you want to combine multiple assemblies. Sometimes it is also useful in RNA-seq transcriptome assembly.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Brian Bushnell 20k

Ram · Answer 18 · 2017-02-09

0

Entering edit mode

7.8 years ago

Eslam Samir ▴ 110

Here is my free program on Github: Sequence database curator

It is a very fast program and it can deal with:

Nucleotide sequences
Protein sequences

It can work under Operating systems:

Windows
Mac
Linux

It also works for:

Fasta format
Fastq format

Best Regards

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 7.8 years ago by Eslam Samir ▴ 110

score 0 · Answer 19 · 2018-10-05

0

Entering edit mode

6.1 years ago

SilentGene ▴ 110

Try to use this python script, which can help you remove duplicates of the sequences in one or several fasta files. And it's easy to use --id or --seq option to indicate whether you like the filter to work according to the id or the sequence itself.

ADD COMMENT • link 6.1 years ago by SilentGene ▴ 110

score 0 · Answer 20 · 2018-11-15

0

Entering edit mode

6.0 years ago

michau ▴ 60

In Jalview (best alignment viewer) option "remove redundancy". You can select treshold. Ctrl + D or edit > remove redundancy jalview alignment viewer +

ADD COMMENT • link 6.0 years ago by michau ▴ 60

0

Entering edit mode

Is there a limit of sequences that this tool can handle? This ight be an option for smaller files but I kind of doubt that a fasta of several GBs can be handled that way.

ADD REPLY • link 6.0 years ago by ATpoint 85k