Hi all,
I was wondering how if there is a command to make sure there is no duplicate sequence in a fasta file, please put your helpful commands.
Thanks
Hi all,
I was wondering how if there is a command to make sure there is no duplicate sequence in a fasta file, please put your helpful commands.
Thanks
If your duplicated sequences have the same ID, the following will give the count per record:
grep ">" tmp.fa | sort | uniq -c
To get only IDs of duplicated sequences with the same ID (Assuming duplicate records have identical IDs)
grep ">" tmp.fa | sort | uniq -d
Now, if you want to check on the sequences themselves, to be on the safe side, in case you are not sure that duplicated sequences have duplicated IDs, you can use the following awk statement (adjust the output the way you like, i.e by printing only the counts or only counts > 1 ...)
awk 'BEGIN{ORS="\n";FS="\n";RS=">"}NR>1{REC[substr($0,index($0,$2))]++} END {for(i in REC){print REC[i],i}}' tmp.fa
Hope this will help,
P.
Hello, You can use EMBOSS app. skipredundant Good luck!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Already asked and answered here and here and here and probably a few others.
lmtgfy (let me google that for you), here.
Thanks for perfect help. I knew how to remove duplicate sequences, but before trying to do it, I just want to make sure there is duplicate sequence.
If there are no duplicate sequence and you use a duplicate remover, the resulting file should be the same so why worry about it?
Just for saving time, because I'm working on the usual laptop and face with a large fasta sequence file.
It's far more productive to just use a dup remover than write a dup detector, so I doubt if anyone has one. Maybe samtools faidx can help.
Well, Dedupe will simply detect duplicate sequences and not remove them if you don't specify an output file :)
Hello, You can use EMBOSS application skipredundant Good luck!