Hi. I wish to remove any sequence that is partial of a longer sequence in multifasta file. For example, let say I have three sequences below:
>seq1
ACGACGATCGT**ACTAGCATCGAGCGTAC**TACGTAGCGCGT
>seq2
**ACTAGCATCGAGCGTAC**
>seq3
AGCAGCGTACGTGACTACGACGATCTACGTATCTAGCTCGTACACT
seq2
is exactly part of seq1
. So after removing the partial (duplicate) sequences, I am expecting to have the following multifasta file:
>seq1
ACGACGATCGTACTAGCATCGAGCGTACTACGTAGCGCGT
>seq3
AGCAGCGTACGTGACTACGACGATCTACGTATCTAGCTCGTACACT
All the answers I managed to search are removal of exact duplicates. Is there any tool or script to achieve the purpose? Thanks in advance.
You can try program CD-HIT with Sequence Identity Parameter = 1. It will cluster all sequences which are identical and return you longest one for each cluster.