Entering edit mode
6.5 years ago
kamel
▴
70
Hi, I have 8 proteomes in multifasta format, the number of protein sequences between 13000 and 14000 sequences. I need to align these sequences to see the cores, the accessory and the unique sequences between these 8 proteomes. do you have a method to do this PLZ.
Four informations: I used proteinortho but I noticed that proteinortho gives only cores and accessory sequences.
Thank you in advance for your response and your help
If you've already got your cores and accessories, you could cluster the proteins to some identity threshold (which presumably you already did with
proteinortho
), using something like PSI-CD-HIT. I'm not sure how well it scales to a dataset that large, but give it a try.Any clusters you get with only a single member are your unique proteins.
Excuse me, but I did not understand what you said. I used proteinortho and got a matrix (.txt file) that does not contain the unique sequences. do you have a method or tool that aligns and gives a matrix with unique, accessory and cores.
Not a single tool no - I am not aware of one from proteomes. Your task will probably require some scripting/coding of your own.
My suggestion is to keep the matrix you already have which gives you 2/3rds of what you asked for, and then cluster your sequences using the
CD-HIT
program to find unique sequences.I don't know how else I can explain it...