Question

Finding similar proteins in bacterial genomes

0

Entering edit mode

9.1 years ago

prostoesh ▴ 20

Hi!

I have a list of bacterial genomes, and I want to find such common proteins from them, which can be found in ALL organisms (common would mean they match for at least 60%). What's the best way to do it?

I'm using biopython with standalone blast+

P.S. Any suggestions would be great, because I was trying to do with a series of local databases, but it didn't turn up as a success

blast alignment • 2.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by prostoesh ▴ 20

Ram · Answer 1 · 2015-10-05

1

Entering edit mode

9.1 years ago

5heikki 11k

With cd-hit:

wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Acholeplasma_brassicae_uid222823/NC_022549.faa
wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Acholeplasma_laidlawii_PG_8A_uid58901/NC_010163.faa
wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Caldicellulosiruptor_hydrothermalis_108_uid60157/NC_014652.faa

i=1; for f in *.faa; do sed -i '' "s/>/>species_"$i"|/" $f; ((i++)); done
#in place sed syntax above is for max os x
cat *.faa > all.faa

cd-hit -i all.faa -o all3 -c 0.65 -d 0

Then in the resulting all3.clstr file you look for clusters like Cluster 5119 below.

>Cluster 5118
0    73aa, >species_3|gi|651868253|ref|YP_008657540.1|... *
>Cluster 5119
0    72aa, >species_1|gi|162446980|ref|YP_001620112.1|... *
1    72aa, >species_2|gi|312127249|ref|YP_003992123.1|... at 65.28%
2    72aa, >species_3|gi|651867008|ref|YP_008656295.1|... at 81.94%
>Cluster 5120
0    72aa, >species_2|gi|312126476|ref|YP_003991350.1|... *

If you want lower %id then you need to change word size also. Check the manual.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by 5heikki 11k

0

Entering edit mode

cd-hit is great, thanks!

can you help me out with output though?

I'm doing command

cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.6 -n 4

and having an output like

>Cluster 0
0    5203aa, >gi|156742169|ref|... *
1    5166aa, >gi|148655222|ref|... at 84%
>Cluster 1
0    4022aa, >gi|156743632|ref|... *

>Cluster 2
0    3861aa, >gi|156742541|ref|... *

...and so on

this means that protein >gi|156742169|ref| is from db1 and similar to >gi|148655222|ref| from db2 on 84%, right?

and that next two lines (claster 1 and claster 2) contains proteins that doesn't have similaryties?

I'm a little confused with these clusters

also, why when I test this prog on 2 files, each consisting of one fake protein

>gi|148654181|ref|YP_001274394.1| chromosomal replication initiation protein [Roseiflexus sp. RS-1]
AAAAAAAAAB

>gi|148654111|ref|YP_001274394.1| chromosomal replication initiation protein [Roseiflexus sp. RS-1]
AAAAAAAAAA

the result is empyness, and should be 90% match?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by prostoesh ▴ 20

0

Entering edit mode

You're correct about the output. In my example I used sed to modify all headers so that it would be easier to interpret the clusters with members from different species. I don't know about your fake input, maybe it's due to to short peptides or the fact that B is not a valid letter..

ADD REPLY • link 9.1 years ago by 5heikki 11k

0

Entering edit mode

ok, thanks a lot!

ADD REPLY • link 9.1 years ago by prostoesh ▴ 20

score 0 · Answer 2 · 2015-10-05

0

Entering edit mode

9.1 years ago

h.mon 35k

I suppose you already annotated the genomes and have lists of genes?

You may use OMA or OrthoMCL to find the orthologous genes between genomes.

ADD COMMENT • link 9.1 years ago by h.mon 35k

0

Entering edit mode

i tried OMA and desided to go with cd-hit :)

thanks anyway!

ADD REPLY • link 9.1 years ago by prostoesh ▴ 20

score 0 · Answer 3 · 2016-06-23

0

Entering edit mode

8.4 years ago

nepgorkhey ▴ 130

You can use get_homologues software for that. Works fine if you have both amino acid and nucleotide sequences.

ADD COMMENT • link 8.4 years ago by nepgorkhey ▴ 130