How to make sure there is no duplicate sequence in a fasta file?
2
0
Entering edit mode
9.2 years ago
seta ★ 1.9k

Hi all,

I was wondering how if there is a command to make sure there is no duplicate sequence in a fasta file, please put your helpful commands.

Thanks

fasta sequence • 10k views
ADD COMMENT
4
Entering edit mode

Already asked and answered here and here and here and probably a few others.

ADD REPLY
2
Entering edit mode

lmtgfy (let me google that for you), here.

ADD REPLY
0
Entering edit mode

Thanks for perfect help. I knew how to remove duplicate sequences, but before trying to do it, I just want to make sure there is duplicate sequence.

ADD REPLY
1
Entering edit mode

If there are no duplicate sequence and you use a duplicate remover, the resulting file should be the same so why worry about it?

ADD REPLY
0
Entering edit mode

Just for saving time, because I'm working on the usual laptop and face with a large fasta sequence file.

ADD REPLY
0
Entering edit mode

It's far more productive to just use a dup remover than write a dup detector, so I doubt if anyone has one. Maybe samtools faidx can help.

ADD REPLY
0
Entering edit mode

Well, Dedupe will simply detect duplicate sequences and not remove them if you don't specify an output file :)

ADD REPLY
0
Entering edit mode

Hello, You can use EMBOSS application skipredundant Good luck!

ADD REPLY
0
Entering edit mode
9.2 years ago
Alternative ▴ 290
  1. If your duplicated sequences have the same ID, the following will give the count per record:

    grep ">" tmp.fa | sort | uniq -c
    
  2. To get only IDs of duplicated sequences with the same ID (Assuming duplicate records have identical IDs)

    grep ">" tmp.fa | sort | uniq -d
    
  3. Now, if you want to check on the sequences themselves, to be on the safe side, in case you are not sure that duplicated sequences have duplicated IDs, you can use the following awk statement (adjust the output the way you like, i.e by printing only the counts or only counts > 1 ...)

    awk 'BEGIN{ORS="\n";FS="\n";RS=">"}NR>1{REC[substr($0,index($0,$2))]++} END {for(i in REC){print REC[i],i}}' tmp.fa
    

Hope this will help,

P.

ADD COMMENT
1
Entering edit mode

grep ">" tmp.fa | sort | uniq -c

ADD REPLY
0
Entering edit mode

Thanks Afagh for the correction. Indeed, sort is mandatory in that case. Corrected

ADD REPLY
0
Entering edit mode
5.0 years ago

Hello, You can use EMBOSS app. skipredundant Good luck!

ADD COMMENT

Login before adding your answer.

Traffic: 2697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6