Question

Remove duplicates in fasta file based on ID

12

Entering edit mode

9.5 years ago

oliver.bayfield ▴ 210

I've found several threads on this (rather simple) topic but none quite simple enough, which is to remove entries in a fasta file based on their one liner >name, which in my case is numeric (gi).

Based on Pierre Lindenbaum's posting on other comments, you would linearise the sequences and then sort by column 1 (as opposed to column 2 if you wanted to sort by sequence). And then you'd employ sort unique and sed?

>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC

There are no spaces between characters or lines in my file.

fasta sort sed • 28k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by oliver.bayfield ▴ 210

2

Entering edit mode

linerarize, sort using options -k1,1 -u, move back to fasta using tr

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Is this correct?

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa |\
tr -d '\n' | tr "#" "\n" | tr "@" "\t" |\
sort -u -t ' ' -f -k1,1 |\
sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by oliver.bayfield ▴ 210

0

Entering edit mode

Update, the above (from Pierre Lindenbaum) does the job. Very good.

The only thing, I dropped the -t ' ' and -f flags in sort (didn't seem necessary?). And the first line in the output file gives a single line of >, which I manually deleted.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by oliver.bayfield ▴ 210

Ram · Answer 1 · 2015-05-22

24

Entering edit mode

9.5 years ago

lh3 33k

awk '/^>/{f=!d[$1];d[$1]=1}f' in.fa > out.fa

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by lh3 33k

1

Entering edit mode

Explanation:

/^>/: when matches line start with > do this block {f=!d[$1];d[$1]=1}.
f=!d[$1]: f is false only if sequence name does exists in d[$1], and keeps being false until new sequence.
d[$1]=1: register sequence name;
f is true, print line.

ADD REPLY • link updated 22 months ago by Ram 44k • written 4.1 years ago by biocyberman ▴ 870

0

Entering edit mode

This solution is perfect, I would very appreciate a simple explanation for it.

ADD REPLY • link updated 22 months ago by Ram 44k • written 7.7 years ago by Caesar ▴ 10

Ram · Answer 2 · 2015-05-22

5

Entering edit mode

9.5 years ago

iraun 6.2k

Try this command:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file.fa | awk '!seen[$1]++'

In the first part the fasta is converted to tabular format and in the second part the duplicated ID's are removed.

You'll need to transform to fasta format again, but it's not complicated ;)

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by iraun 6.2k

0

Entering edit mode

airan thanks that seems to work on a test file - any chance you could guide me doing the conversion back to fasta? I'm rather inexperienced here. Thanks

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by oliver.bayfield ▴ 210

2

Entering edit mode

Extending airan solution for converting it to a fasta format:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$1]++' | awk -v OFS="\n" '{print $1,$2}'

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

And in case your headers have spaces and you also want to consider the sequence

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' biostarhelp.txt | awk '!seen[$0]++' | awk -v OFS="\n" '{for(i=2;i<NF;i++) head = head " " $i; print $1 " " head,$NF; head = ""}'

ADD REPLY • link 6.3 years ago by marcosmorgan ▴ 120

Ram · Answer 3 · 2015-05-22

3

Entering edit mode

9.5 years ago

oliver.bayfield ▴ 210

Thanks for your comments. I went with the orignal suggestion of Pier's:

sed -e '/^>/s/$/@/' -e 's/^>/#/' filein.fa | tr -d '\n' | tr "#" "\n" | tr "@" "\t" | sort -u -k1,1 | sed -e 's/^/>/' -e 's/\t/\n/' > fileout.fa

The first line of the output contains a '>' only, which I deleted manually.

ADD COMMENT • link 9.5 years ago by oliver.bayfield ▴ 210

0

Entering edit mode

Please see: Best bioinfo one-liners? for a better way to linearize fasta. At the end you can convert the lines to fasta using tr "\t" "\n"

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Pierre Lindenbaum 164k

Ram · Answer 4 · 2015-05-22

2

Entering edit mode

9.5 years ago

Varun Gupta ★ 1.3k

This would work assuming you have sequence in one line and it is not split over multiple lines.

#!/usr/bin/perl-w
use strict;
use warnings;

my %id2seq=();
my $key = '';
while(<>){
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
    }else{
        $id2seq{$key} = $_;
    }
}

foreach(keys %id2seq){
    print join("\n",">".$_,$id2seq{$_}),"\n";
}

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

For multi-line sequences, I think you can use $id2seq{$key} .= $_

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by biolab ★ 1.4k

0

Entering edit mode

@ biolab

It doesn't work if 2 same id's are adjacent some thing like this

>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
CCGG
>123456
AAAGTGTGTAGGAAGATGTGATGCCTCGAGATGC
CCGG

IT will append it for the same header

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

Hi, Varun Gupta, your point is very good. I further modified the script to allow multi-line sequences. If a header is duplicated, only the 1st is outputted.

#!/usr/bin/perl
use strict;
use warnings;

my (%id2seq, %seen);
my ($key, $duplicate);
while(<>) {
    chomp;
    if($_ =~ /^>(.+)/){
        $key = $1;
        if (exists $seen{$key}) {
            print STDERR "Attention: header $key duplicated.\n";
            $duplicate  = 1;
        } else {
            $seen{$key} = 1;
            $duplicate  = 0;
        }
    } else {
        ($duplicate == 1) ? (next) : ($id2seq{$key} .= $_);
    }
}

foreach(keys %id2seq) {
    print join("\n",">".$_,$id2seq{$_}),"\n";
}

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by biolab ★ 1.4k

Ram · Answer 5 · 2015-05-22

2

Entering edit mode

9.5 years ago

tomc ▴ 90

Assuming dup.fasta has one >defline followed by one line of sequence

awk '/^>/{id=$0;getline;arr[id]=$0}END{for(id in arr)printf("%s\n%s\n",id,arr[id])}' dup.fasta > uniq.fasta

There is no requirement duplicates be adjacent, there is no guarantee the output order is related to the input order.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by tomc ▴ 90

Ram · Answer 6 · 2017-02-18

1

Entering edit mode

7.8 years ago

Eslam Samir ▴ 110

Here is my free program on Github Sequence database curator

It is a very fast program and it can deal with:

Nucleotide sequences
Protein sequences

It can work under Operating systems:

Windows
Mac
Linux

It also works for:

Fasta format
Fastq format

Best Regards

ADD COMMENT • link updated 22 months ago by Ram 44k • written 7.8 years ago by Eslam Samir ▴ 110

Ram · Answer 7 · 2015-05-22

0

Entering edit mode

9.5 years ago

anp375 ▴ 190

If you use perl, you could put the sequences in a hash with the name/number as the key and the sequence as the value. Then you could print it to a new file, or empty your file and print it back in. Every duplicate you come across will just replace the previous one.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by anp375 ▴ 190

0

Entering edit mode

Could you give an example @anp375?

ADD REPLY • link 9.5 years ago by oliver.bayfield ▴ 210

0

Entering edit mode

sure, in a few hours though

ADD REPLY • link 9.5 years ago by anp375 ▴ 190

1

Entering edit mode

use Bio::SeqIO;
use strict;

my $protein_fasta = "/Somefilepath/protein.txt";
my $protein_out = ">/Somefilepath/reduced.txt";

my $seq_in = Bio::SeqIO->new(-file => "$protein_fasta", -format =>'Fasta');# This is a constructor for SeqIO

# The filename specifies whether it is read or write
# The big arrows are called fat commas. They are just commas used for documenting. -format => 'Fasta' is
# just -format, 'Fasta' but the arrow shows the relationship for readability
my $seq_out = Bio::SeqIO->new(-file => "$protein_out", -format => 'Fasta');# SeqIO out because of >


# Those constructors set up the filehandles. The method to get the sequences inside is called 'next_seq'
# It returns a generically formatted sequence rather than Fasta

my %seqs; # This is a hash. I'm assuming the name you are using is an accession number, so I'll make the keys accession numbers. Only one of each will be left.
while(my$seq = $seq_in -> next_seq){
    $seqs{$seq->accession_number} = $seq;
}

foreach(values %seqs){
    $seq_out->write_seq($_);# write_seq converts the generic sequence to a Fasta sequence and writes
                            # it to the file

}

I hope that works. If the number is not the accesion number, there is a list of other ids here: http://search.cpan.org/dist/BioPerl/Bio/Seq.pm#accession_number

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by anp375 ▴ 190

Ram · Answer 8 · 2018-11-15

Copied from my other post: How to remove duplicate sequences in fasta file using python?

Learn to use Biopython library. It's handy as hell. You can use any format as in/out

from Bio import SeqIO

with open('output.fasta', 'a') as outFile:
    record_ids = list()\
    for record in SeqIO.parse('input.fasta', 'fasta'):
        if record.id not in record_ids:
            record_ids.append( record.id )
            SeqIO.write(record, outFile, 'fasta')