Question

How to find three consecutive orthologous genes in 800 bacterial genomes?

0

Entering edit mode

5.8 years ago

natasha.sernova ★ 4.0k

Dear all, I am afraid this particular question has been asked several times, but I failed to find any of the previous posts. I have 800 bacterial genomes. I know that some of these genomes may have the group of three consecutive genes with any probable insertions of some foreign genes. Sometimes one or two genes from such a group are lost – it does not matter, I need the rest left. What is the easiest way to find any orthologs of these genes in 800 bacterial genomes? I am not sure three simple alignments of a single gene sequence with all 800 genomes will help. (I read such a discussion some time ago, I have not found it.) And I am not sure I know a good soft to do it. I hope there is a better way I have forgotten about. Thank you very much! Sincerely, Natasha

genome bacteria software alignment • 2.2k views

ADD COMMENT • link updated 5.8 years ago by Mensur Dlakic ★ 29k • written 5.8 years ago by natasha.sernova ★ 4.0k

2

Entering edit mode

This would not be a simple thing since you admit that

some of these genomes may have the group of three consecutive genes with any probable insertions of some foreign genes.

I would suggest that you use the three genes independently to locate their homologs in 800 genomes and then try to reconcile the results to see if they are within a certain distance and/or present in the order you expect.

Ortholog-finder may also come in handy.

ADD REPLY • link 5.8 years ago by GenoMax 152k

1

Entering edit mode

before I really chip in here, can I ask for a clarification of the following

Sometimes one or two genes from such a group are lost – it does not matter, I need the rest left.

do I understand correctly that from your group of three, up to two can be lost (== so only one of three remains) ? How would you detect that one then as once being part of that group?

I'm asking because I might have an approach but that has a lower limit of three (eg. in group of 4 one can be lost) but for less it becomes less feasible or even impossible

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

Unfortunately it’s possible. The situation like: gene1-insertion-gene3 is common, as well as just any single gene left out of these three, like: gene1-insertion1-insertion2 or insertion1-gene2- insertion2 or insertion1-insertion2-gene 3. I will have to check these three genes separately as @genomax suggested. Oh, and measure the distance between genes in this case: gene1-insertion-gene3. But how to make it easy? Will ortholog-finder help with this task?

ADD REPLY • link 5.8 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

I understand, but what I want to say is that if only gene1 is left, you can not determine whether it was once of the group or not (and thus always has been a single gene, and the other two were never there). without additional evidence that is? are the inserted ones 'conserved' ?

Is the order important btw? is it always g1 g2 g3 or can it be g2 g1g3 , ... ?

ADD REPLY • link 5.8 years ago by lieven.sterck 15k

0

Entering edit mode

The order is strongly conserved. It depends only upon the strand. It's either g1 g2 g3 or g3 g2 g1. Actually I was wrong - I don't have insertions, I may have some simple replacement of any of the three genes to some 'hypothetical' gene that is not orthologous to the replaced gene.

ADD REPLY • link 5.8 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Dear all, many thanks for your answers, all of them are really helpful!

ADD REPLY • link 5.8 years ago by natasha.sernova ★ 4.0k

score 2 · Answer 1 · 2019-10-04

2

Entering edit mode

5.8 years ago

Mensur Dlakic ★ 29k

I can think of two options, and will list them by increasing time investment.

1) Submit your proteins, one at a time, to STRING. This will automatically create gene neighborhood plots for all genomes it has, and that should give you a pretty good idea how frequently these proteins are found next to each other. I know that is not the same as interrogating your 800 genomes, but it is likely that most of them will already be included in STRING.

2) A variation of what was already suggested - get proteomes for all your species, concatenate them into a single file, and search your proteins of interest individually against this database. Post-processing of the three outputs would involve extracting GI numbers for matches, and finding how many times you have 3 consecutive GI numbers when you combine the three outputs. If you want to allow for insertions, you can stipulate that a difference between smallest and largest GI number can be up to 4 instead of 2, which would allow for 2 inserted genes. By the way, this can be done at DNA level as well by concatenating .ffn instead of .faa files.

ADD COMMENT • link 5.8 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

A partial problem is that I have only *.gb-format files. gbk-format disappeared in 2013-14. *.gb are human-readable text files, but how to transfer them to any other format I don't know. It's 'almost' a previous gbk-format, but literally 'almost'...

ADD REPLY • link 5.8 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

Try seqret from EMBOSS to convert the files.

ADD REPLY • link 5.8 years ago by GenoMax 152k

0

Entering edit mode

Many thanks! *.gb implies genbank, I think.

http://emboss.sourceforge.net/docs/faq.html

Q) What sequence formats are supported?

A) Many:

gcg, embl, swissprot, fasta, ncbi, genbank, nbrf, codata, strider, clustal, phylip, acedb, msf, ig, staden, text, raw, asis

ADD REPLY • link 5.8 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

sreformat from the old HMMer package (v2.3-ish) can convert GenBank files to FASTa and other formats.

ADD REPLY • link 5.8 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you very much!

But any newer HMMer package cannot?

ADD REPLY • link 5.8 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

It seems that the equivalent package in current HMMer (esl-reformat) can't do GenBank conversion. HMMer keeps an archive of old versions, and I think that v2.3.2 will work. sreformat is an auxiliary program that can be found in squid directory after compilation.

ADD REPLY • link 5.8 years ago by Mensur Dlakic ★ 29k