I would like to dereplicate a 3 GB fasta file of amino acid sequences. I would like this to include the removal of shorter sequences found in longer sequences (substring dereplication). The purpose of this is the construction of a smaller database against which I search peptide mass spectra and the identification of abundant sequences in this file.
So far, I have explored prefix-dereplication (slightly different than what I ideally want) in vsearch and substring-dereplication in usearch, but neither is satisfactory. Prefix-dereplication by vsearch does not support protein sequences. Substring dereplication by usearch requires the use of v.5.2. The freely available version of usearch-5.2 has an insufficient memory limit of 2 GB.
Does anyone know of a tool that will suit my needs? Thanks in advance.
I am not sure if the substring de-replication is part of it but you can take a look at CD-HIT for this purpose.
cd-hit-dup fails with the message
This may be due to the fact that I have sequences as short as length 9. Fundamentally, this command is geared toward longer nucleotide sequences. It also does not do any form of substring dereplication.
pir peptide search could be helpful even if it does not answer your question.
User genomax's comment led me to the exact solution that I wanted. Instead of the CD-HIT tool, cd-hit-dup, use instead the tool, cd-hit. This clusters sequences, including subsequences. One can specify a sequence identity of 100%. The following command writes two files: a dereplicated fasta file and a file identifying the sequences in each cluster.