Question

Mapping GeneSymbol plus cDNA position to a Chr:Position ... without transcript ID.

2

Entering edit mode

2.3 years ago

LauferVA 4.5k

Hey Biostars,

I have related but distinct problems. I'd like to ask if there is an easy way to solve either one.

Problem 1: My collaborators provided me a file with Gene Name, cDNA, and protein change but no unique identifiers (!!), as follows:

gene    alteration  cdna
AKT1    E17K    c.49G>A
AKT1    G162S   c.484G>A
AKT1    G162D   c.485G>A
AKT1    R174C   c.520C>T
AKT1    R174C   c.520C>T
AKT1    A188G   c.563C>G
ALK H976H   c.2928C>T
ALK N986N   c.2958C>T
ALK E994K   c.2980G>A
ALK P999L   c.2996C>T
ALK C1008Y  c.3023G>A
ALK H1063P  c.3188A>C
ALK E1065K  c.3193G>A
ALK L1080R  c.3239T>G
ALK T1090I  c.3269C>T
ALK G1184W  c.3550G>T
ALK V1185V  c.3555G>T
ALK E1197K  c.3589G>A

First, I know this is technically not soluble without using what amounts to inference. Now then, is there any non-awful way to map these to correctly map these variants? For instance, I suppose I could pull the MANE transcript isoform, and see if that corresponds to the alteration in cDNA. Supposing that it does correspond perfectly in every case, I think I could surmise that I have the correct transcript isoform ... I can definitely do this, but I wanted to see if anyone has anyone faced this problem before? What did you do? There are many genes - cannot be done manually.

Problem 2: Suppose I am ultimately able to figure out the problem above (by "figure out" I suppose I mean accurately annotate every variant with an unambiguous label of some kind (rsID, or a chr:pos pair and a build), or what have you. The next thing I would like to do is determine how far each variant is from a given annotation. Specifically, I want to know how far each variant is from the nearest Intron-Exon junction; or the nearest known splice site.

Open to any tools ...

Thank you

annotation splice site transcript build • 1.1k views

ADD COMMENT • link 2.3 years ago by LauferVA 4.5k

1

Entering edit mode

Problem 1: You have to go back to your collaborators and ask for either the unique ids or the raw data. This is an unattainable situation.

Problem 2: Bedtools is probably the answer here.

ADD REPLY • link 2.3 years ago by Mark ★ 1.6k

0

Entering edit mode

Im writing a function to map the cDNAs to every known transcript isoform, then determine a fit score for each. if the results make no sense (they will, though), ill go back to them.

For 2, thanks v. much will check dox

VAL

ADD REPLY • link 2.3 years ago by LauferVA 4.5k

score 1 · Accepted Answer · 2022-08-20

I was able to answer this question on my own.

So, in this case there is one way to proceed. First, a few caveats.

Caveats: The solution I reached is ultimately probabilistic in nature. What I is, one is not guaranteed to be able to uniquely map the variants back to the correct transcript isoform (meaning, the original one that whoever did the mapping mapped it to; the one that was used to generate HGVS IDs). Having said that, based on the results I obtained, if you have more than 10 variants you are likely to arrive at a unique solution, and if you have more than 15 the probability of not being able to uniquely map decays for almost every gene.

Step 1: Start by downloading every transcript annotation for the build you are on. If you dont know the build you are on, repeat all steps for all steps. I recommended downloading at minimum all Ensembl transcripts, all refSeq transcripts, all knownGenetranscripts, and all refGene transcripts.

Step 2: Either learn how to programmatically access UCSC, Ensembl, NCBI, or a like database, or write your own scripts.

Step 3: Using the transcripts you have downloaded, make a hashtable-like object in whatever language you are using that stores Gene as the Key, and every transcript from all 4 databases above as the values.

Step 4: Write a function that will loop over all the variants in your data by gene, then stop.

Step 5: Loop over every gene in the genome, pulling the variants for each gene each time by calling the function above.

Step 6: Write another function that stores the number of variants that correctly correspond to the actual sequence in your data per transcript. For instance, if your data contains EGFR c.389T>A c.489C>A c.589G>A c.689T>C c.789A>T c.889T>C, you count the number of variants for which your data match the WT sequence in the transcript.

Step 7: Let Step 6 run until you obtain all the match counts for all transcripts for each gene. Now, write another function to compare transcript match scores for all the transcripts matching a gene, and pull only the transcripts that are a perfect match to your data (correct matches = 62, variants in data = 62).

Step 8: If there is only one transcript that is a perfect match, and nothing else is close, this is likely correct. Pull this and put it in a "high-confidence bucket" for every gene.

Step 9: Now, look across the genes at the winning transcript for each gene. Does a trend emerge? In other words, do you find that all the transcripts that have perfect match sequences come from NCBI RefSeq? If so, do alll of them correspond to a version that was current at the same time?

For me, the answer to those questions was yes. There was a clear trend that showed me what had been done. If I were to have generated ambiguous results, I am not sure what I would have done. But this produced unambiguous results for me, so I was done.

VL