Question

how to identify cores, accesory and unique sequences from proteomes

0

Entering edit mode

6.5 years ago

kamel ▴ 70

Hi, I have 8 proteomes in multifasta format, the number of protein sequences between 13000 and 14000 sequences. I need to align these sequences to see the cores, the accessory and the unique sequences between these 8 proteomes. do you have a method to do this PLZ.

Four informations: I used proteinortho but I noticed that proteinortho gives only cores and accessory sequences.

Thank you in advance for your response and your help

alignment sequence genome • 1.3k views

ADD COMMENT • link updated 6.2 years ago by bioinfo17 ▴ 30 • written 6.5 years ago by kamel ▴ 70

0

Entering edit mode

If you've already got your cores and accessories, you could cluster the proteins to some identity threshold (which presumably you already did with proteinortho), using something like PSI-CD-HIT. I'm not sure how well it scales to a dataset that large, but give it a try.

Any clusters you get with only a single member are your unique proteins.

ADD REPLY • link 6.5 years ago by Joe 21k

0

Entering edit mode

Excuse me, but I did not understand what you said. I used proteinortho and got a matrix (.txt file) that does not contain the unique sequences. do you have a method or tool that aligns and gives a matrix with unique, accessory and cores.

ADD REPLY • link 6.5 years ago by kamel ▴ 70

0

Entering edit mode

Not a single tool no - I am not aware of one from proteomes. Your task will probably require some scripting/coding of your own.

My suggestion is to keep the matrix you already have which gives you 2/3rds of what you asked for, and then cluster your sequences using the CD-HIT program to find unique sequences.

I don't know how else I can explain it...

ADD REPLY • link 6.5 years ago by Joe 21k

score 1 · Answer 1 · 2018-09-06

1

Entering edit mode

6.2 years ago

bioinfo17 ▴ 30

use the -singles option in proteinortho command

ADD COMMENT • link 6.2 years ago by bioinfo17 ▴ 30