Hi there fellas
I am here again. I am trying to cluster a fasta file with a bunch of 5mers (peptides). I download and installed (conda) the cd-hit software.
The file is something like this (that is the initial 10 seqs from the real file):
>0
AAAAA
>1
AAAAC
>2
AAAAD
>3
AAAAE
>4
AAAAF
>5
AAAAG
>6
AAAAH
>7
AAAAI
>8
AAAAK
>9
AAAAL
>10
AAAAM
I using the basics commands like suggested in: https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT Command:
$ cd-hit -i all_poss_1000.txt -c 0.9 -o all_poss_1000_clustered -n 5
However I got this output:
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Oct 26 2019, 14:51:47
Command: cd-hit -i all_poss_1000.txt -c 0.9 -o
all_poss_1000_clustered -n 5
Started: Tue Jan 28 11:09:21 2020
================================================================
Output
----------------------------------------------------------------
total seq: 0
longest and shortest : 0 and 18446744073709551615
Total letters: 0
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 0M
Buffer : 1 X 10M = 10M
Table : 1 X 65M = 65M
Miscellaneous : 0M
Total : 75M
Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 90520634
0 finished 0 clusters
Approximated maximum memory consumption: 75M
writing new database
writing clustering information
program completed !
Total CPU time 0.05
Question: what I am doing wrong???
My input? I divide the full file in files with 1000 entries.
Thank you for your time and help.
Paulo
PS- I will need to read a little bit about the subject! But I really appreciate you help and tips guys! Thank you very much...see ya soon probably 8)
Could you post an extract of the real file?
Did you process the file in any other environment than linux? If so there might be hidden characters in it. Moreover, i don't think that a word size (kmer) of 5 makes sense here as all sequences have a different 5mer so none of them will "merge"
Hi lieven, thats the first 10 sequences form my input file! 8)
alright, I thought it was a dummy example. my bad
I think the other answers here are worth experimenting with, and ordinarily I'd advise the exact opposite of what I'm about to suggest, but I think, for this, since the data looks reasonably simple, you don't need something as heavyweight as
CD-HIT
.If all of your 5mers are similar to the data you posted, they're essentially already 'aligned' so there's no need to do any alignment step, and instead we can just treat it as a string similarity problem.
Very simply, you can do all-vs-all comparisons and compute a Levenshtein/Hamming/Cosine distance between the 'words' and assign them to clusters based on similarity thresholds (something using a scoring function and
itertools.combinations
in python would work I'd expect.Altenatively, you could go down a slightly more heuristic approach with something like: https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups
The only issue I can think with this approach is whether you're considering 'similarity' to mean the biological definition of amino acid similarity, in which case it is a bit more complicated than just word matching.
Hi Joe
I have done something like this:
(You get 20**5 kmers.)
To agroup the mer according with the aac composition. I think maybe it can become easier to find out if a mer or group of mers belongs to a class with some important functional motif in a protein. It agroupd mers like that abcd, a2bcd as keys for the mers AAAAA, AAAAB, etc as values.
I am not still thinking in any evolutionary classification.
But I was thinking in one approach like using some python module to try a clustering as you mention.
Again I REALLY appreciate you help and attention, you and others here are always very helping.
Thank you guys some much.
Paulo
What I mean is, in your example data, all of those sequences are 1 residue different from one another. In my understanding, this makes them all equivalent as a cluster with a 'distance' of 1. Is this what you are looking for, or do you need some more meaningful clustering whereby amino acid properties are conserved (.e.g., should
AAAAR
cluster with onlyAAAAH
and andAAAAK
by virtue of all having positively charged side chains)?Otherwise all of
AAAAx
should be a cluster wheren
is anything in your alphabetACDEFGHIKLMNPQRSTVWY
.I'm just not clear yet on what you want the clusters to end up like, and what the demarcation between clusters is supposed to be (identity? similarity? does position specificity matter? i.e. is
RAAAA
similar toAAAAR
.)Hi Joe
my intentions are to cluster all motifs (20**5) in groups of shared/similar aac compostions, i.e. those that are permutations of a single sequence. there would have different groups of kmers like: a1b1c1d1e1 , a1b1c1d2 , a1b2c2 .... a5 . Each cluster would have to include compositions as a1b4, b1a4, a1c4 , etc. So, each cluster would include all kmers that are permutations of a given aac composition, e.g. abbbb, babbb, etc.
Sorry english is not my first language and some times it is hard to explain doubts in a meaningful way!
Hope now it gets a little more clear!
PS- I try many of the suggestions but didn't get any results with cd-hit! Maybe if I try change my code adding as you suggest a distance measure I could be done in python (which I would like very much).
Not sure now if you are still looking for cd-hit to work. But if I use your example I can cluster them. What command are you using...? This works
cd-hit -i yourinput -c 0.8 -o cdhitout -n 2 -l 2
It is in my first question. I think i tried -c 0.8 but not -n2.
I will check gb.
Thank you
I will try to use a different file.
Now I got it with cd-hit.
I can only recreate your error by giving an input file that does not exist
I think i give it a much bigger file ;)
that
-n
need to be smaller then 5 in your case. (and maybe even smaller then 4 depending on the lvl of similairity you want)This option mainly exist to improve speed but your case is a little more unique. It helps to make an "index" for faster clustering. I have no time now to explain it in detail, blast has a similar mechanism maybe you can learn from that if you really want to know. (you could search megablast vs blastn).Also maybe increasing the memory limit helps. Try
-M 0
Thank you gb.
I will check out your suggestions to understand the theory with blast.
Thanks for your time and patience. I am very grateful for all you guys have doing for my learning!
Yeah I don't think CD-HIT is going to work for that (not least because it does care about where in the sequence they differ (e.g. its unlikely to cluster
AAAAB
withBAAAA
).I think a brute force string searching approach is your only real option. Its easy enough to get string compositions with
collections.Counter
.I don't imagine this is going to be quick on a file that large though.