How is possible to compare, for a first general analysis, the proteomes of two species? In particular I want to see which proteins (or protein families) in vertebrates DO NOT have homologue counterpart in drosophila (with a very high homology cutoff). Which programs I have to use to identify and list such proteins? If I use blast which general strategy should I use? I would be very glad to have just some insights regard which programs or tutorials I have to study,
Thank you a lot
good work
This is a super vague answer, but in my lab people have used Scaffold and Perseus in the past for comparing enriched peptides between one sample and another.
If the species you are interested already have annotated genomes available, there is a good chance this analysis has already been done for you: check the databases OrthoDB, OMA, and others - search for ortholog database.
If you want to find orthologs on your data (transcriptome assembly, protein set derived from genome annotation), you have to predict them, again there are several choices: OrthoMCL, OMA, ProteinOrtho, and many others. Several of them use blast (or other similarity search) as a starting point, but add varied methods to filter / refine the groups found.
Really thank you for the clear explanation. I tried to play with biomart. I got soon aware that if u dont put gene source ensemble in the filters it basically finds that all genes are not hortologues. For example I tried human versus chimpanzee and it found 60000 genes in human with no hortologue. Putting gene source ensemble in the filter the number was just 26.... then when I try to download the file with the results it always lists around 60000 outputs... What am I missing?
Sounds to me like you just want to find (or rather exclude) orthologues of protein coding genes between a vertebrate species and drosophila.
You can do this across the whole genome in Ensembl's BioMart - video tutorial here.
Step 1: Choose Ensembl Genes as your database, then choose a dataset for your species of interest - either the vertebrate or drosophila (fruitfly).
Step 2: Choose filters, click on Filters in the left hand navigation panel. Expand the GENE section and for the section 'Gene type' choose protein_coding. Then expand the MULTI SPECIES COMPARISONS and choose Orthologous fruitfly genes (or your vertebrate of interest) and make sure to select the Excluded option below the drop down menu.
Step 3: Choose the attributes you want, click on Attributes in the left hand navigation panel. Ensembl stable ID is chosen by default, but you can choose UniProt IDs or whatever is relevant.
Step 4: Get results by clicking on the Results button above the left-hand navigation panel.
My query as an example - I was looking for human genes without a fruitfly orthologue (it's most of the human genes...!)
This is a super vague answer, but in my lab people have used Scaffold and Perseus in the past for comparing enriched peptides between one sample and another.
If the species you are interested already have annotated genomes available, there is a good chance this analysis has already been done for you: check the databases OrthoDB, OMA, and others - search for
ortholog database
.If you want to find orthologs on your data (transcriptome assembly, protein set derived from genome annotation), you have to predict them, again there are several choices: OrthoMCL, OMA, ProteinOrtho, and many others. Several of them use blast (or other similarity search) as a starting point, but add varied methods to filter / refine the groups found.
See this review for an introduction (and source of databases and programs): New Tools in Orthology Analysis: A Brief Review of Promising Perspectives.
Really thank you for the clear explanation. I tried to play with biomart. I got soon aware that if u dont put gene source ensemble in the filters it basically finds that all genes are not hortologues. For example I tried human versus chimpanzee and it found 60000 genes in human with no hortologue. Putting gene source ensemble in the filter the number was just 26.... then when I try to download the file with the results it always lists around 60000 outputs... What am I missing?