Question

Substring dereplication of protein sequences

1

Entering edit mode

7.6 years ago

smiller ▴ 70

I would like to dereplicate a 3 GB fasta file of amino acid sequences. I would like this to include the removal of shorter sequences found in longer sequences (substring dereplication). The purpose of this is the construction of a smaller database against which I search peptide mass spectra and the identification of abundant sequences in this file.

So far, I have explored prefix-dereplication (slightly different than what I ideally want) in vsearch and substring-dereplication in usearch, but neither is satisfactory. Prefix-dereplication by vsearch does not support protein sequences. Substring dereplication by usearch requires the use of v.5.2. The freely available version of usearch-5.2 has an insufficient memory limit of 2 GB.

Does anyone know of a tool that will suit my needs? Thanks in advance.

dereplication proteomics • 3.1k views

ADD COMMENT • link 7.6 years ago by smiller ▴ 70

0

Entering edit mode

I am not sure if the substring de-replication is part of it but you can take a look at CD-HIT for this purpose.

ADD REPLY • link 7.6 years ago by GenoMax 152k

0

Entering edit mode

cd-hit-dup fails with the message

cd-hit-dup: cdhit-dup.cxx:193: int HashingDepth(int, int): Assertion `len >= min' failed.

This may be due to the fact that I have sequences as short as length 9. Fundamentally, this command is geared toward longer nucleotide sequences. It also does not do any form of substring dereplication.

ADD REPLY • link 7.6 years ago by smiller ▴ 70

0

Entering edit mode

pir peptide search could be helpful even if it does not answer your question.

ADD REPLY • link 7.6 years ago by me ▴ 760

0

Entering edit mode

User genomax's comment led me to the exact solution that I wanted. Instead of the CD-HIT tool, cd-hit-dup, use instead the tool, cd-hit. This clusters sequences, including subsequences. One can specify a sequence identity of 100%. The following command writes two files: a dereplicated fasta file and a file identifying the sequences in each cluster.

./cd-hit -i <input fasta> -o <output fasta> -c 1 -t 1 -d 0

ADD REPLY • link 7.6 years ago by smiller ▴ 70

score 0 · Answer 1 · 2017-12-05

0

Entering edit mode

7.6 years ago

smiller ▴ 70

Use CD-HIT for this task.

./cd-hit -i <input fasta> -o <output fasta> -c 1 -t 1 -d 0

ADD COMMENT • link 7.6 years ago by smiller ▴ 70