Remove duplicates in fasta file based on ID
8
12
Entering edit mode
9.5 years ago

I've found several threads on this (rather simple) topic but none quite simple enough, which is to remove entries in a fasta file based on their one liner >name, which in my case is numeric (gi).

Based on Pierre Lindenbaum's posting on other comments, you would linearise the sequences and then sort by column 1 (as opposed to column 2 if you wanted to sort by sequence). And then you'd employ sort unique and sed?

>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

There are no spaces between characters or lines in my file.

fasta sort sed • 28k views
ADD COMMENT
2
Entering edit mode

linerarize, sort using options -k1,1 -u, move back to fasta using tr

ADD REPLY
0
Entering edit mode

Is this correct?

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t ' ' -f -k1,1 |\
sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa
ADD REPLY
0
Entering edit mode

Update, the above (from Pierre Lindenbaum) does the job. Very good.

The only thing, I dropped the -t ' ' and -f flags in sort (didn't seem necessary?). And the first line in the output file gives a single line of >, which I manually deleted.

ADD REPLY
24
Entering edit mode
9.5 years ago
lh3 33k
awk '/^>/{f=!d[$1];d[$1]=1}f' in.fa > out.fa
ADD COMMENT
1
Entering edit mode

Explanation:

  • /^>/: when matches line start with > do this block {f=!d[$1];d[$1]=1}.
  • f=!d[$1]: f is false only if sequence name does exists in d[$1], and keeps being false until new sequence.
  • d[$1]=1: register sequence name;
  • f is true, print line.
ADD REPLY
0
Entering edit mode

This solution is perfect, I would very appreciate a simple explanation for it.

ADD REPLY
5
Entering edit mode
9.5 years ago
iraun 6.2k

Try this command:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa | awk '!seen[$1]++'

In the first part the fasta is converted to tabular format and in the second part the duplicated ID's are removed.

You'll need to transform to fasta format again, but it's not complicated ;)

ADD COMMENT
0
Entering edit mode

airan thanks that seems to work on a test file - any chance you could guide me doing the conversion back to fasta? I'm rather inexperienced here. Thanks

ADD REPLY
2
Entering edit mode

Extending airan solution for converting it to a fasta format:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$1]++' | awk -v OFS="\n" '{print $1,$2}'
ADD REPLY
0
Entering edit mode

And in case your headers have spaces and you also want to consider the sequence

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$0]++' | awk -v OFS="\n" '{for(i=2;i<NF;i++) head = head " " $i; print $1 " " head,$NF; head = ""}'
ADD REPLY
3
Entering edit mode
9.5 years ago

Thanks for your comments. I went with the orignal suggestion of Pier's:

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -k1,1 | sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa

The first line of the output contains a '>' only, which I deleted manually.

ADD COMMENT
0
Entering edit mode

Please see: Best bioinfo one-liners? for a better way to linearize fasta. At the end you can convert the lines to fasta using tr "\t" "\n"

ADD REPLY
2
Entering edit mode
9.5 years ago
Varun Gupta ★ 1.3k

This would work assuming you have sequence in one line and it is not split over multiple lines.

#!/usr/bin/perl-w
use strict;
use warnings;

my %id2seq=();
my $key = '';
while(<>){
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
    }else{
        $id2seq{$key} = $_;
    }
}

foreach(keys %id2seq){
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD COMMENT
0
Entering edit mode

For multi-line sequences, I think you can use $id2seq{$key} .= $_

ADD REPLY
0
Entering edit mode

@ biolab

It doesn't work if 2 same id's are adjacent some thing like this

>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
CCGG
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
CCGG

IT will append it for the same header

ADD REPLY
0
Entering edit mode

Hi, Varun Gupta, your point is very good. I further modified the script to allow multi-line sequences. If a header is duplicated, only the 1st is outputted.

#!/usr/bin/perl
use strict;
use warnings;

my (%id2seq, %seen);
my ($key, $duplicate);
while(<>) {
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
        if (exists $seen{$key}) {
            print STDERR "Attention: header $key duplicated.\n";
            $duplicate  = 1;
        } else {
            $seen{$key} = 1;
            $duplicate  = 0;
        }
    } else {
        ($duplicate == 1) ? (next) : ($id2seq{$key} .= $_);
    }
}

foreach(keys %id2seq) {
    print join("\n",">".$_,$id2seq{$_}),"\n";
}
ADD REPLY
2
Entering edit mode
9.5 years ago
tomc ▴ 90

Assuming dup.fasta has one >defline followed by one line of sequence

awk '/^>/{id=$0;getline;arr[id]=$0}END{for(id in arr)printf("%s\n%s\n",id,arr[id])}' dup.fasta > uniq.fasta

There is no requirement duplicates be adjacent, there is no guarantee the output order is related to the input order.

ADD COMMENT
1
Entering edit mode
7.8 years ago
Eslam Samir ▴ 110

Here is my free program on Github Sequence database curator

It is a very fast program and it can deal with:

  1. Nucleotide sequences
  2. Protein sequences

It can work under Operating systems:

  1. Windows
  2. Mac
  3. Linux

It also works for:

  1. Fasta format
  2. Fastq format

Best Regards

ADD COMMENT
0
Entering edit mode
9.5 years ago
anp375 ▴ 190

If you use perl, you could put the sequences in a hash with the name/number as the key and the sequence as the value. Then you could print it to a new file, or empty your file and print it back in. Every duplicate you come across will just replace the previous one.

ADD COMMENT
0
Entering edit mode

Could you give an example @anp375?

ADD REPLY
0
Entering edit mode

sure, in a few hours though

ADD REPLY
1
Entering edit mode
use Bio::SeqIO;
use strict;

my $protein_fasta = "/Somefilepath/protein.txt";
my $protein_out = ">/Somefilepath/reduced.txt";

my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format =>'Fasta');# This is a constructor for SeqIO

# The filename specifies whether it is read or write
# The big arrows are called fat commas. They are just commas used for documenting. -format => 'Fasta' is
# just -format, 'Fasta' but the arrow shows the relationship for readability
my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'Fasta');# SeqIO out because of >


# Those constructors set up the filehandles. The method to get the sequences inside is called 'next_seq'
# It returns a generically formatted sequence rather than Fasta

my %seqs; # This is a hash. I'm assuming the name you are using is an accession number, so I'll make the keys accession numbers. Only one of each will be left.
while(my$seq = $seq_in -> next_seq){
    $seqs{$seq->accession_number} = $seq;
}

foreach(values %seqs){
    $seq_out->write_seq($_);# write_seq converts the generic sequence to a Fasta sequence and writes
                            # it to the file

}

I hope that works. If the number is not the accesion number, there is a list of other ids here: http://search.cpan.org/dist/BioPerl/Bio/Seq.pm#accession_number

ADD REPLY
0
Entering edit mode
6.0 years ago
michau ▴ 60

Copied from my other post: How to remove duplicate sequences in fasta file using python?

Learn to use Biopython library. It's handy as hell. You can use any format as in/out

from Bio import SeqIO

with open('output.fasta', 'a') as outFile:
    record_ids = list()\
    for record in SeqIO.parse('input.fasta', 'fasta'):
        if record.id not in record_ids:
            record_ids.append( record.id )
            SeqIO.write(record, outFile, 'fasta')
ADD COMMENT

Login before adding your answer.

Traffic: 3030 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6