Mapping GeneSymbol plus cDNA position to a Chr:Position ... without transcript ID.
1
2
Entering edit mode
2.3 years ago
LauferVA 4.5k

Hey Biostars,

I have related but distinct problems. I'd like to ask if there is an easy way to solve either one.

Problem 1: My collaborators provided me a file with Gene Name, cDNA, and protein change but no unique identifiers (!!), as follows:

gene    alteration  cdna
AKT1    E17K    c.49G>A
AKT1    G162S   c.484G>A
AKT1    G162D   c.485G>A
AKT1    R174C   c.520C>T
AKT1    R174C   c.520C>T
AKT1    A188G   c.563C>G
ALK H976H   c.2928C>T
ALK N986N   c.2958C>T
ALK E994K   c.2980G>A
ALK P999L   c.2996C>T
ALK C1008Y  c.3023G>A
ALK H1063P  c.3188A>C
ALK E1065K  c.3193G>A
ALK L1080R  c.3239T>G
ALK T1090I  c.3269C>T
ALK G1184W  c.3550G>T
ALK V1185V  c.3555G>T
ALK E1197K  c.3589G>A

First, I know this is technically not soluble without using what amounts to inference. Now then, is there any non-awful way to map these to correctly map these variants? For instance, I suppose I could pull the MANE transcript isoform, and see if that corresponds to the alteration in cDNA. Supposing that it does correspond perfectly in every case, I think I could surmise that I have the correct transcript isoform ... I can definitely do this, but I wanted to see if anyone has anyone faced this problem before? What did you do? There are many genes - cannot be done manually.

Problem 2: Suppose I am ultimately able to figure out the problem above (by "figure out" I suppose I mean accurately annotate every variant with an unambiguous label of some kind (rsID, or a chr:pos pair and a build), or what have you. The next thing I would like to do is determine how far each variant is from a given annotation. Specifically, I want to know how far each variant is from the nearest Intron-Exon junction; or the nearest known splice site.

Open to any tools ...

Thank you

annotation splice site transcript build • 1.1k views
ADD COMMENT
1
Entering edit mode

Problem 1: You have to go back to your collaborators and ask for either the unique ids or the raw data. This is an unattainable situation.

Problem 2: Bedtools is probably the answer here.

ADD REPLY
0
Entering edit mode

Im writing a function to map the cDNAs to every known transcript isoform, then determine a fit score for each. if the results make no sense (they will, though), ill go back to them.

For 2, thanks v. much will check dox

VAL

ADD REPLY
1
Entering edit mode
2.3 years ago
LauferVA 4.5k

I was able to answer this question on my own.

So, in this case there is one way to proceed. First, a few caveats.

Caveats: The solution I reached is ultimately probabilistic in nature. What I is, one is not guaranteed to be able to uniquely map the variants back to the correct transcript isoform (meaning, the original one that whoever did the mapping mapped it to; the one that was used to generate HGVS IDs). Having said that, based on the results I obtained, if you have more than 10 variants you are likely to arrive at a unique solution, and if you have more than 15 the probability of not being able to uniquely map decays for almost every gene.

Step 1: Start by downloading every transcript annotation for the build you are on. If you dont know the build you are on, repeat all steps for all steps. I recommended downloading at minimum all Ensembl transcripts, all refSeq transcripts, all knownGenetranscripts, and all refGene transcripts.

Step 2: Either learn how to programmatically access UCSC, Ensembl, NCBI, or a like database, or write your own scripts.

Step 3: Using the transcripts you have downloaded, make a hashtable-like object in whatever language you are using that stores Gene as the Key, and every transcript from all 4 databases above as the values.

Step 4: Write a function that will loop over all the variants in your data by gene, then stop.

Step 5: Loop over every gene in the genome, pulling the variants for each gene each time by calling the function above.

Step 6: Write another function that stores the number of variants that correctly correspond to the actual sequence in your data per transcript. For instance, if your data contains EGFR c.389T>A c.489C>A c.589G>A c.689T>C c.789A>T c.889T>C, you count the number of variants for which your data match the WT sequence in the transcript.

Step 7: Let Step 6 run until you obtain all the match counts for all transcripts for each gene. Now, write another function to compare transcript match scores for all the transcripts matching a gene, and pull only the transcripts that are a perfect match to your data (correct matches = 62, variants in data = 62).

Step 8: If there is only one transcript that is a perfect match, and nothing else is close, this is likely correct. Pull this and put it in a "high-confidence bucket" for every gene.

Step 9: Now, look across the genes at the winning transcript for each gene. Does a trend emerge? In other words, do you find that all the transcripts that have perfect match sequences come from NCBI RefSeq? If so, do alll of them correspond to a version that was current at the same time?

For me, the answer to those questions was yes. There was a clear trend that showed me what had been done. If I were to have generated ambiguous results, I am not sure what I would have done. But this produced unambiguous results for me, so I was done.

VL

ADD COMMENT

Login before adding your answer.

Traffic: 1642 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6