Hi,
I am having some difficulty with the approaches suggested and the reality of using said approaches.
Basically I want to look at variations in homologous proteins to a protein I am interested in. So I have been told by a hundred different people to "just do it this way" and bingo the world is perfect. However the data and the procedure does not reflect the ease of its use.
So,
1 - Blastp my protein to homologous proteins, and clustal-o for aligning them.
2 - Get the gene names for each of the blasted proteins .
3 - Get allele data or polymorphism data for each of these genes from genomic database such as gnomAD.
However how does my clustal-o alignment of the proteins relate to the genomic sequence alleles ? And how does an aligned protein variant seen at residue 50 relate to the allele at a particular position in a gene?
- Do I have to map the gene codons to the transcribed protein residues ? So that if there is a polymorphism in one codon I can then say that it is associated with the variant seen at the amino acid transcribed? I.e. how does a polymorphism relate to a protein variant?
I get many homologous proteins for which there is no data at all about its associated gene in databases - so my frequency in this case is N/A which cannot be used in a numerical analysis.
- what do you do with missing data ?
I am at a loss of how to relate all the data available - and whilst there is a lot of information about the data, there is little that associates one data source to another.
AND I'M NOT LOOKING FOR ANSWERS - I WANT SOME SORT OF DISCUSSION OR EXPLANATION OF THE RATIOCINATION OF THE PROCEDURES INVOLVED. WHATS THE REASON FOR ASSOCIATING OR NOT ASSOCIATING PROTEIN VARIANTS WITH GENETIC ALLELES FOR EXAMPLE, I AM NOT ASKING YOU TO DO THIS FOR ME! - this isn't yelling, just trying to stress my point. I'm not into this instant gratification culture, so I would like a discussion.
Thanks,
U.
Thanks for the reply,
One of the toy problems I am trying to replicate in the real world, with real world data, is finding positions on genes that have been mutated, and see if these relate to variations seen on corresponding proteins. So a protein we get has variant at AA 100 - Serine -> Valine. I want to look at the same positions in its gene and determine frequency of alleles seen. I want to see how rare particular polymorphisms are for genes associated with a protein for which we see variations. So I want to find the codon position in my gene relating to the protein variant, and check the frequency of changes seen in the population at this position.
Then this can be extended to look at the allele frequency for orthologoous genes/proteins for a particular protein variation seen.
Is the approach even correct? Is the biological rational useful?
The allele frequency and the N/A remark If I get 100 proteins/genes that are homologous to mine. And if gnomAD has no data on any of these genes I cannot get any allele information on it. Therefore from 100 genes I get nothing. I have my current variation etc., but I have no other data. So for this protein the numerical result from the allele analysis will be "weaker" than that from an analysis for which we have many data in gnomAD.
Genomic sequence position -> protein variation position issue The codon -> AA check is fine, genetic code etc. What do I use, or what do I look into for mapping DNA locations to protein sequence locations? Because the protein sequences I have, when I translate the canonical or allelic DNA sequences I get a completely different AA chain than that seen from the protein itself.
Thanks, U.
This is true, but it's not as if you have no information. For example, you could determine whether the intra-species heterogeneity reflects the inter-species heterogeneity in terms of residue conservation. If your species of interest has lots of significant variant in a inter-species conserved region then maybe these homology proteins actually have a different function to the one you're looking it.
That depends on the annotations you actually have for your species.
That's definitely something that you need to sort out! Is your reference genome the same? Do you have the correct frame? Are your gene annotations correct? There's a number of reason this could fail and having your DNA sequence match your AA sequence is a good sanity check on your data. It might not be in the format you think it is :(
So what you are saying in the first case is that if most of the proteins residues are conserved over related species, and we find variations in my protein of interest that occur in this region, one conclusion (on the surface) to be drawn is that the functional performance of this protein may differ from those homologous proteins?
Is there a set of tools for particular mappings of DNA locations to protein sequence locations? And I can find out which suits depending on my annotations? For example no one would advocate blastp for finding homologous proteins - whether you want to use it or not is problem specific.
I sometimes get lost among the trees and cannot see the forest. There are so much data and strategies that some may seem logical until you actually implement it.
Thanks, U.
On the surface yes. You would almost certainly have to test that experimentally following structural bioinformatics analyses (if they indicate changes that impact structure/conformation).
Are you referring to codon usage? Unless you are looking at distant species/organelle genomes usage should be similar.
Yes, I understand some level of experimental verification would be needed - or indeed some probabilistic inference on such designations. Thanks for this, I just wanted to see the end goal of some analyses.
No not codon usage, I can perform this pretty well. I am looking for some tools or software that can aid in mapping protein positions to the original positions in the gene. I want this so that I can perform say an allele frequency analysis per position/codon.
Thanks, U.
You can use
tblastn
(ortblastx
) for mapping protein sequence back on to DNA. If these are very similar sequence you could even useblat
to speed things up significantly.