Question

How to link UCSC peptide to transcripts

0

Entering edit mode

10.4 years ago

jacobsen.jeremy ▴ 40

I am attempting to insert observed variant modifications from Annovar, into protein sequences that I have retrieved from the UCSC file knownGeneTxPep. Variant positions from the start of a transcript were retrieved from Annovar. Here is my question:

When I make a mapping from peptide id (say uc010nwy.3) to transcript ID (say NM_0010757) using "kgXref" there is not a 1:1 mapping. There are more peptide ids than transcript ids, meaning multiple tsids map to a peptide ID. This confounds what I am trying to do because I don't know which peptide sequence to alter when annovar says that a variant was observed in transcript X.

I'm not certain I'm using the correct files for the task and I've been unable to find any documentation. Any help would be great.

Thanks,
Jeremy

UCSC RNA-Seq SNP • 2.6k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 10.4 years ago by jacobsen.jeremy ▴ 40

0

Entering edit mode

This seems unnecessary complicated to me. Shouldn't Annovar tell you what the protein change caused by your variant is? What exactly is the information you have and what is the information you want?

ADD REPLY • link updated 3.9 years ago by Ram 45k • written 10.4 years ago by Bert Overduin ★ 3.7k

0

Entering edit mode

I've narrowed things down a little. The problem seems to be with the Annovar entries that have more than one transcript associated with a variant. For instance, this entry seems to be correct:

line64679    nonsynonymous SNV    YTHDC2:NM_022828:exon26:c.C3757G:p.L1253V,    5    112920108    112920108    C    G

By correct I mean that when I use kgxref to get the uniprot ID that corresponds to NM_022828, there is a L at position 1253.

On the other hand, when there is more than one refSeq id in the annovar output (variants affects multiple transcripts)... for instance:

line64929    nonsynonymous SNV    HSD17B4:NM_001199291:exon7:c.T392A:p.L131Q,HSD17B4:NM_000414:exon6:c.G317A:p.R106H,HSD17B4:NM_001199292:exon5:c.G263A:p.R88H,    5    118811533    118811533    G    A

Now I use kgxref to pull out the protein sequence associated with NM_001199291 and there is no L at position 131, but rather a T.

ADD REPLY • link updated 3.9 years ago by Ram 45k • written 10.4 years ago by jacobsen.jeremy ▴ 40