Question

Forum:Theory v Reality

0

Entering edit mode

6.6 years ago

urema • 0

Hi,

I am having some difficulty with the approaches suggested and the reality of using said approaches.

Basically I want to look at variations in homologous proteins to a protein I am interested in. So I have been told by a hundred different people to "just do it this way" and bingo the world is perfect. However the data and the procedure does not reflect the ease of its use.

So,

1 - Blastp my protein to homologous proteins, and clustal-o for aligning them.

2 - Get the gene names for each of the blasted proteins .

3 - Get allele data or polymorphism data for each of these genes from genomic database such as gnomAD.

However how does my clustal-o alignment of the proteins relate to the genomic sequence alleles ? And how does an aligned protein variant seen at residue 50 relate to the allele at a particular position in a gene?

Do I have to map the gene codons to the transcribed protein residues ? So that if there is a polymorphism in one codon I can then say that it is associated with the variant seen at the amino acid transcribed? I.e. how does a polymorphism relate to a protein variant?

I get many homologous proteins for which there is no data at all about its associated gene in databases - so my frequency in this case is N/A which cannot be used in a numerical analysis.

what do you do with missing data ?

I am at a loss of how to relate all the data available - and whilst there is a lot of information about the data, there is little that associates one data source to another.

AND I'M NOT LOOKING FOR ANSWERS - I WANT SOME SORT OF DISCUSSION OR EXPLANATION OF THE RATIOCINATION OF THE PROCEDURES INVOLVED. WHATS THE REASON FOR ASSOCIATING OR NOT ASSOCIATING PROTEIN VARIANTS WITH GENETIC ALLELES FOR EXAMPLE, I AM NOT ASKING YOU TO DO THIS FOR ME! - this isn't yelling, just trying to stress my point. I'm not into this instant gratification culture, so I would like a discussion.

Thanks,
U.

protein DNA polymorphism variation • 1.6k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 6.6 years ago by urema • 0

score 1 · Answer 1 · 2018-04-26

1

Entering edit mode

6.6 years ago

d-cameron ★ 2.9k

WHATS THE REASON FOR ASSOCIATING OR NOT ASSOCIATING PROTEIN VARIANTS WITH GENETIC ALLELES FOR EXAMPLE, I AM NOT ASKING YOU TO DO THIS FOR ME!

THERE'S NO NEED TO YELL - WE CAN BE QUITE CIVIL HERE :)

Blastp my protein to homologous proteins, and clustal-o for aligning them.

This will give you a mapping of equivalent residues in all your homologous proteins.

However how does my clustal-o alignment of the proteins relate to the genomic sequence alleles?

It doesn't directly. You need to convert each genomic sequence allele to a residue change (or no change if synonymous). The clustal-o alignment gives you the mapping of the protein residue change to the set of equivalent residues (and changes) in all the other proteins. I presume you're interested in what's conserved across the family and what is not. Doing the comparison on the the clustal-o matched residues, allows you to find this out.

I get many homologous proteins for which there is no data at all about its associated gene in databases - so my frequency in this case is N/A which cannot be used in a numerical analysis.

Well, you have at least one data point there. You know that there exists a homologous protein, but you're lacking further annotation. I'm not sure how your frequency is NA when you actually have a data point.

what do you do with missing data ?

You are, and will for a very long time, be missing many many data points. gnomAD is missing variants, relevant species are missing reference genomes, reference genomes are missing. None of your information is exhaustive.

ADD COMMENT • link 6.6 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Thanks for the reply,

One of the toy problems I am trying to replicate in the real world, with real world data, is finding positions on genes that have been mutated, and see if these relate to variations seen on corresponding proteins. So a protein we get has variant at AA 100 - Serine -> Valine. I want to look at the same positions in its gene and determine frequency of alleles seen. I want to see how rare particular polymorphisms are for genes associated with a protein for which we see variations. So I want to find the codon position in my gene relating to the protein variant, and check the frequency of changes seen in the population at this position.

Then this can be extended to look at the allele frequency for orthologoous genes/proteins for a particular protein variation seen.

Is the approach even correct? Is the biological rational useful?

The allele frequency and the N/A remark If I get 100 proteins/genes that are homologous to mine. And if gnomAD has no data on any of these genes I cannot get any allele information on it. Therefore from 100 genes I get nothing. I have my current variation etc., but I have no other data. So for this protein the numerical result from the allele analysis will be "weaker" than that from an analysis for which we have many data in gnomAD.

Genomic sequence position -> protein variation position issue The codon -> AA check is fine, genetic code etc. What do I use, or what do I look into for mapping DNA locations to protein sequence locations? Because the protein sequences I have, when I translate the canonical or allelic DNA sequences I get a completely different AA chain than that seen from the protein itself.

Thanks, U.

ADD REPLY • link 6.6 years ago by urema • 0

1

Entering edit mode

So for this protein the numerical result from the allele analysis will be "weaker" than that from an analysis for which we have many data in gnomAD.

This is true, but it's not as if you have no information. For example, you could determine whether the intra-species heterogeneity reflects the inter-species heterogeneity in terms of residue conservation. If your species of interest has lots of significant variant in a inter-species conserved region then maybe these homology proteins actually have a different function to the one you're looking it.

What do I use, or what do I look into for mapping DNA locations to protein sequence locations?

That depends on the annotations you actually have for your species.

when I translate the canonical or allelic DNA sequences I get a completely different AA chain than that seen from the protein itself.

That's definitely something that you need to sort out! Is your reference genome the same? Do you have the correct frame? Are your gene annotations correct? There's a number of reason this could fail and having your DNA sequence match your AA sequence is a good sanity check on your data. It might not be in the format you think it is :(

ADD REPLY • link 6.6 years ago by d-cameron ★ 2.9k

0

Entering edit mode

So what you are saying in the first case is that if most of the proteins residues are conserved over related species, and we find variations in my protein of interest that occur in this region, one conclusion (on the surface) to be drawn is that the functional performance of this protein may differ from those homologous proteins?

Is there a set of tools for particular mappings of DNA locations to protein sequence locations? And I can find out which suits depending on my annotations? For example no one would advocate blastp for finding homologous proteins - whether you want to use it or not is problem specific.

I sometimes get lost among the trees and cannot see the forest. There are so much data and strategies that some may seem logical until you actually implement it.

Thanks, U.

ADD REPLY • link 6.6 years ago by urema • 0

0

Entering edit mode

we find variations in my protein of interest that occur in this region, one conclusion (on the surface) to be drawn is that the functional performance of this protein may differ from those homologous proteins?

On the surface yes. You would almost certainly have to test that experimentally following structural bioinformatics analyses (if they indicate changes that impact structure/conformation).

Is there a set of tools for particular mappings of DNA locations to protein sequence locations?

Are you referring to codon usage? Unless you are looking at distant species/organelle genomes usage should be similar.

ADD REPLY • link 6.6 years ago by GenoMax 147k

0

Entering edit mode

Yes, I understand some level of experimental verification would be needed - or indeed some probabilistic inference on such designations. Thanks for this, I just wanted to see the end goal of some analyses.

No not codon usage, I can perform this pretty well. I am looking for some tools or software that can aid in mapping protein positions to the original positions in the gene. I want this so that I can perform say an allele frequency analysis per position/codon.

Thanks, U.

ADD REPLY • link 6.6 years ago by urema • 0

0

Entering edit mode

You can use tblastn (or tblastx) for mapping protein sequence back on to DNA. If these are very similar sequence you could even use blat to speed things up significantly.

ADD REPLY • link 6.6 years ago by GenoMax 147k