I just started learning the compara API. However, I am still not sure whether it can address my questions. I am wondering if someone could give me some guidance and example scripts.
Here is my question: (1) I want to identify in human genome all the DNA fragments that are significantly similar (homology, by lastZ or BlastZ). (2) Then, I want to find in which of the other species, two homology DNA fragments of human are significantly similar (aligned) to one genomic region in that species.
Alternatively, I can focus on two genomic regions in a genome to test if they are homologous and then which species has one genomic region that is aligned to both of the human genomic regions.
Particularly, I am wondering in the human self alignment, one genomic region may be mapped to multiple other regions. These multiple hits also exist in e.g. the mouse genome of the human vs mouse genome alignment.
Does ensembl provide all these multiple regions or just the best one?
Any scripts that can achieve my goals? My compara API version is 95.
Based on API documentation, you should be able to retrieve human self-alignments and cross-spaces alignments for the regions you'd like to query. Compara works on top of aligned sequences, so if you think the alignments are not accurate or exhaustive, you may need to re-do the alignments using tools of your choice.
Paralogs and orthologs are not defined as repetitive regions within or across genomes, but defined by evolutionary relationships, or, phylogeny. Orthologs are genes duplicated along with speciation event, paralogs are genes duplicated within species. Beyond genes, genomic sequences diverge very fast following any duplication events, as they are constraint by little selective force. As a result, evolutionary relationships for non-genic sequences are very hard to infer, therefore, paralogous relationships outside of genes are very hard to investigate. If you read the documentation on Ensembl, compara is largely about genes. They built evolutionary relationships (paralogous or orthologous) based on comparisons between gene trees and species trees. In theory, we can apply this approach to any sequences, as long as we can establish a sequence phylogeny and compare it with species phylogeny. But I don't think compara does this.
Based on API documentation, you should be able to retrieve human self-alignments and cross-spaces alignments for the regions you'd like to query. Compara works on top of aligned sequences, so if you think the alignments are not accurate or exhaustive, you may need to re-do the alignments using tools of your choice.