I have a list of protein sequences and a list of peptides I want to assign to the protein sequences.
I tried cd-hit-2d (http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide) for this:
cd-hit-2d -d 0 -i proteins.faa -i2 peptides.faa -o matched_peptides -c 1.0
For some sequences I don't get a match at all where there should be a match:
$ grep -P "pep5\." *clstr
$
However:
>pep5
SVVLLDEVEK
>PROKKA_43260
...APY___SVVLLDEVEK___AHPDVLEMFFQVFDKGLMDDAEGREIDFRNTVIIL
TSNAGSQHIMQACFEKDEELGGAV...
Can this be because the peptide-sequence is too short?
For others I noticed that they appear in 1 cluster but have identical matches to multiple proteins:
$ grep -P "pep7\." *clstr -C 5
>Cluster 57774
0 502aa, >PROKKA_167265... *
1 12aa, >pep7... at 100.00%
2 11aa, >pep318... at 100.00%
However it should match more than once:
>pep7
VVNPLGEPIDGK
>PROKKA_167265
...ILGEYKHIEEGFTVKRTGTIFSVPVG
EGMLGR____VVNPLGEPIDGK____GPIQT...
>PROKKA_136748
....VILGEYKHIEEGFTVKRTGTIFSVPVG
EAMLGR____VVNPLGEPIDGK____GPILTDKVRPV...
Is this a general behavior of cd-hit to assign only the first match to a cluster and is there a way to control or change this?
Overall, is this tool (https://research.bioinformatics.udel.edu/peptidematch/commandlinetool.jsp) better suited for this kind of task? I would also appreciate other suggestions, also allowing a certain number of mismatches.
I'd use cd-hit instead of cd-hit -2d for more than 2 sequences as explained here, also depending on the sequence identity threshold applied, cd-hit will cluster sequences but it's unlikely that one sequence would end up in multiple clusters. Taken from: https://github.com/weizhongli/cdhit/wiki/1.-Algorithm
Thanks. As I would want each protein sequence to be a representative I don't think just using cd-hit on, I guess, the pooled list of proteins and peptides, would help me here.
This partly answers the 2nd question, but I did not find an option to change this default.
cd-hit algorithm mode can be changed with -g parameter as described in the User's manual.
PS: If you found that the comments and answers have helped then please upvote and/or accept answers to show the appreciation.
The is more towards what happens when a query sequence could fit into multiple clusters with equal level of similarity (identical to part of the representative), that is also not clear in the the parameter description. From what I see it is only assigned to 1.
While I really appreciate you replying to my question, I don't feel that the comments have answered the questions.
as mentioned on the cd-hit wiki page - "CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses" so short answer to your question is - no, like I mentioned in my previous comment I don't think it is possible within cd-hit to assign a sequence to multiple clusters maybe because it defeats the purpose of clustering.
Maybe you should give peptidematch a shot and see if it gives you the desired output.
PS: Thanks for appreciating the efforts. :)