Dear all,
I would like to understand how to find a particular gene (the orthologous genes) in a set of eukaryotic genomes. The simplest way I see is to divide each chromosome into LOCUS-blocks, then read each block line by line, and if a particular name was encountered, save that block.
But I see several problems. Is it correct that all these orthologs will have the same name?
I am not completely sure.
Another thing is the following - when I am reading the LOCUS block, I cannot stop reading after the first gene name appearance. I have to finish the reading line by line without paying attention to the next chance to see the same name - a particular name can be encountered several times per block. If I see it for the second time, the block should acquire another weight.
The gene name is usually in quotes. It doesn't matter for Linux search, isn't it?
my $block="";
my $blockisgood=0;
# Zero for false, 1 for true
# reading from input-file
while($line=<IN>) {
next if $line =~ /^\s$/; # skip empty line
if($line =~ /^LOCUS/) { #starting a new block
# but print old block if it was good
if($blockisgood) {
print OUT $block;
}
# and reset
$blockisgood=0;
$block="";
# now check for blocks we are looking for
if($line =~MYGENE/) {$blockisgood=1;}
}
$block .= $line;
}
#the last block printing to output
print OUT ($block) if $blockisgood;
This doesn't want to work for my gene. Please, help me!
Thank you very much!
Natasha
Hi,
Genes have not necessarily the same name (even if annotators try to do that the most as possible). It will depend of genomes you use.
If i well understand you try to do analyse on synthenic regions. If information (i.e gene name) are not common between your different genomes you have to verify if the genes are orthologs. To do that, the most accurate (but the most difficult) is to use a phylogenetic approach. Most of people prefer use the approach of similarity between the sequences in order to define (assume) the relationship between the sequences. It is really easier to setting up but is bit less accurate.
If your genomes are known (as example present in Ensembl database) you can also use their relationship annotations between the sequences of different species.
Yes, you are absolutely right!
I try to analyse synthenic regions. Could you, please, give me some details - how to use a phylogenetic approach as the most reliable one? What tools do exist for doing that?
Will Ensembl help with different ortholog names? I don't need just the closes right-left neghbour, I would like to see at least a few genes to the left and to the right. How to do it correctly?
Many thanks!
I can advise you to read this: http://www.ncbi.nlm.nih.gov/pubmed/19740451
It seems to me that I already saw automated tools for syntenic region analysis/detection in several congress. I think you should to spend more time on your bibliography.
If your genomes are in Ensembl Database and you use the Perl programmation, it should not be to difficult to program a pipeline that does what you want. For each gene it is possible to know the localisation and the list of ortholog/paralog genes.
Thank you, it's a very nice paper! I will ask for their code.