Question

CD-hit-2d for matching peptides

0

Entering edit mode

7.6 years ago

malteherold ▴ 60

I have a list of protein sequences and a list of peptides I want to assign to the protein sequences. I tried cd-hit-2d (http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide) for this: cd-hit-2d -d 0 -i proteins.faa -i2 peptides.faa -o matched_peptides -c 1.0

For some sequences I don't get a match at all where there should be a match:

$ grep -P "pep5\." *clstr
$

However:

>pep5
SVVLLDEVEK

>PROKKA_43260
...APY___SVVLLDEVEK___AHPDVLEMFFQVFDKGLMDDAEGREIDFRNTVIIL
TSNAGSQHIMQACFEKDEELGGAV...

Can this be because the peptide-sequence is too short?

For others I noticed that they appear in 1 cluster but have identical matches to multiple proteins:

$ grep -P "pep7\." *clstr -C 5

>Cluster 57774
0   502aa, >PROKKA_167265... *
1   12aa, >pep7... at 100.00%
2   11aa, >pep318... at 100.00%

However it should match more than once:

>pep7
VVNPLGEPIDGK

>PROKKA_167265
...ILGEYKHIEEGFTVKRTGTIFSVPVG
EGMLGR____VVNPLGEPIDGK____GPIQT...


>PROKKA_136748
....VILGEYKHIEEGFTVKRTGTIFSVPVG
EAMLGR____VVNPLGEPIDGK____GPILTDKVRPV...

Is this a general behavior of cd-hit to assign only the first match to a cluster and is there a way to control or change this?

Overall, is this tool (https://research.bioinformatics.udel.edu/peptidematch/commandlinetool.jsp) better suited for this kind of task? I would also appreciate other suggestions, also allowing a certain number of mismatches.

cd-hit peptides protein peptidematch • 3.6k views

ADD COMMENT • link 7.6 years ago by malteherold ▴ 60

1

Entering edit mode

I'd use cd-hit instead of cd-hit -2d for more than 2 sequences as explained here, also depending on the sequence identity threshold applied, cd-hit will cluster sequences but it's unlikely that one sequence would end up in multiple clusters. Taken from: https://github.com/weizhongli/cdhit/wiki/1.-Algorithm

CD-HIT is a greedy incremental clustering approach. The basic CD-HIT algorithm sorts the input sequences from long to short, and processes them sequentially from the longest to the shortest. The first sequence is automatically classified as the first cluster representative sequence. Then each query sequence of the remaining sequences is compared to the representative sequences found before it, and is classified as redundant or representative based on whether it is similar to one of the existing representative sequences. In default manner (fast mode), a query is grouped into the first representative without comparing to other representatives. In accurate mode, a query is compared to all representatives and grouped to the most similar one.

ADD REPLY • link 7.6 years ago by Sej Modha 5.3k

0

Entering edit mode

Thanks. As I would want each protein sequence to be a representative I don't think just using cd-hit on, I guess, the pooled list of proteins and peptides, would help me here.

In default manner (fast mode), a query is grouped into the first representative without comparing to other representatives. In accurate mode, a query is compared to all representatives and grouped to the most similar one.

This partly answers the 2nd question, but I did not find an option to change this default.

ADD REPLY • link 7.6 years ago by malteherold ▴ 60

2

Entering edit mode

cd-hit algorithm mode can be changed with -g parameter as described in the User's manual.

-g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters

PS: If you found that the comments and answers have helped then please upvote and/or accept answers to show the appreciation.

ADD REPLY • link 7.6 years ago by Sej Modha 5.3k

0

Entering edit mode

The is more towards what happens when a query sequence could fit into multiple clusters with equal level of similarity (identical to part of the representative), that is also not clear in the the parameter description. From what I see it is only assigned to 1.

While I really appreciate you replying to my question, I don't feel that the comments have answered the questions.

ADD REPLY • link 7.6 years ago by malteherold ▴ 60

0

Entering edit mode

Is this a general behavior of cd-hit to assign only the first match to a cluster and is there a way to control or change this?

as mentioned on the cd-hit wiki page - "CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses" so short answer to your question is - no, like I mentioned in my previous comment I don't think it is possible within cd-hit to assign a sequence to multiple clusters maybe because it defeats the purpose of clustering.

Overall, is this tool (https://research.bioinformatics.udel.edu/peptidematch/commandlinetool.jsp) better suited for this kind of task? I would also appreciate other suggestions, also allowing a certain number of mismatches.

Maybe you should give peptidematch a shot and see if it gives you the desired output.

PS: Thanks for appreciating the efforts. :)

ADD REPLY • link 7.6 years ago by Sej Modha 5.3k