Finding similar proteins in bacterial genomes
3
0
Entering edit mode
9.1 years ago
prostoesh ▴ 20

Hi!

I have a list of bacterial genomes, and I want to find such common proteins from them, which can be found in ALL organisms (common would mean they match for at least 60%). What's the best way to do it?

I'm using biopython with standalone blast+

P.S. Any suggestions would be great, because I was trying to do with a series of local databases, but it didn't turn up as a success

blast alignment • 2.6k views
ADD COMMENT
1
Entering edit mode
9.1 years ago
5heikki 11k

With cd-hit:

wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Acholeplasma_brassicae_uid222823/NC_022549.faa
wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Acholeplasma_laidlawii_PG_8A_uid58901/NC_010163.faa
wget ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Caldicellulosiruptor_hydrothermalis_108_uid60157/NC_014652.faa

i=1; for f in *.faa; do sed -i '' "s/>/>species_"$i"|/" $f; ((i++)); done
#in place sed syntax above is for max os x
cat *.faa > all.faa

cd-hit -i all.faa -o all3 -c 0.65 -d 0

Then in the resulting all3.clstr file you look for clusters like Cluster 5119 below.

>Cluster 5118
0    73aa, >species_3|gi|651868253|ref|YP_008657540.1|... *
>Cluster 5119
0    72aa, >species_1|gi|162446980|ref|YP_001620112.1|... *
1    72aa, >species_2|gi|312127249|ref|YP_003992123.1|... at 65.28%
2    72aa, >species_3|gi|651867008|ref|YP_008656295.1|... at 81.94%
>Cluster 5120
0    72aa, >species_2|gi|312126476|ref|YP_003991350.1|... *

If you want lower %id then you need to change word size also. Check the manual.

ADD COMMENT
0
Entering edit mode

cd-hit is great, thanks!

can you help me out with output though?

I'm doing command

cd-hit-2d -i db1 -i2 db2 -o db2novel -c 0.6 -n 4

and having an output like

>Cluster 0
0    5203aa, >gi|156742169|ref|... *
1    5166aa, >gi|148655222|ref|... at 84%
>Cluster 1
0    4022aa, >gi|156743632|ref|... *

>Cluster 2
0    3861aa, >gi|156742541|ref|... *

...and so on

this means that protein >gi|156742169|ref| is from db1 and similar to >gi|148655222|ref| from db2 on 84%, right?

and that next two lines (claster 1 and claster 2) contains proteins that doesn't have similaryties?

I'm a little confused with these clusters

also, why when I test this prog on 2 files, each consisting of one fake protein

>gi|148654181|ref|YP_001274394.1| chromosomal replication initiation protein [Roseiflexus sp. RS-1]
AAAAAAAAAB
>gi|148654111|ref|YP_001274394.1| chromosomal replication initiation protein [Roseiflexus sp. RS-1]
AAAAAAAAAA

the result is empyness, and should be 90% match?

ADD REPLY
0
Entering edit mode

You're correct about the output. In my example I used sed to modify all headers so that it would be easier to interpret the clusters with members from different species. I don't know about your fake input, maybe it's due to to short peptides or the fact that B is not a valid letter..

ADD REPLY
0
Entering edit mode

ok, thanks a lot!

ADD REPLY
0
Entering edit mode
9.1 years ago
h.mon 35k

I suppose you already annotated the genomes and have lists of genes?

You may use OMA or OrthoMCL to find the orthologous genes between genomes.

ADD COMMENT
0
Entering edit mode

i tried OMA and desided to go with cd-hit :)

thanks anyway!

ADD REPLY
0
Entering edit mode
8.4 years ago
nepgorkhey ▴ 130

You can use get_homologues software for that. Works fine if you have both amino acid and nucleotide sequences.

ADD COMMENT

Login before adding your answer.

Traffic: 2003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6