Question

How to find the amino acid residue that is encoded by codon which is encoded by two different adjacent exons

0

Entering edit mode

9.5 years ago

Les Ander ▴ 110

I am trying to identity all the amino acids in each protein in the human genome in which the residues are split between two adjacent exons.

To illustrate the problem, suppose I have the following peptide PNKCSGMRFP

Suppose that the residues PNKCS were encoded by exon 2 but the codons encoding "G" amino acid (which is encoded by the GGA codon) was split either as G|GA (the G on the left side of "|" is encoded by exon 2 and the "GA" on the right side of "|" is encoded by exon 3) or as GG|A.

Thank you

Lee

exon • 3.1k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.5 years ago by Les Ander ▴ 110

Ram · Answer 1 · 2016-01-14

you could use UCSC knownGene and awk to get the position of the exons. I put a awk script here: https://github.com/lindenb/awk-sandbox/blob/master/src/bio/ucsc/biostar172743.awk

$ curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c | awk -f biostar172743.awk  | head

#NAME    CHROM    POS-0    STRAND    EXON    CODON-0    CDNA-0    PROT-0
uc010nxq.1    chr1    12226    +    Exon_1    2    38    12
uc021oeh.1    chr1    324685    +    Exon_2    1    172    57
uc021oeh.1    chr1    325123    +    Exon_3    2    578    192
uc001abv.1    chr1    865715    +    Exon_3    2    254    84
uc001abv.1    chr1    866468    +    Exon_4    2    305    101
uc001abw.1    chr1    865715    +    Exon_3    2    254    84
uc001abw.1    chr1    866468    +    Exon_4    2    305    101
uc001abw.1    chr1    871275    +    Exon_5    1    430    143
uc001abw.1    chr1    874508    +    Exon_6    1    520    173

note: I haven't much checked the output of this awk script. Please check a few points

score 0 · Answer 2 · 2016-01-14

You'll need to pick your source for transcripts, exons, etc because they will differ somewhat between sources. I usually advocate for Ensembl or Gencode as being the most comprehensive, although some of the exons and transcripts may be non-canonical and how biologically meaningful some minor or alternative exons/transcripts are is up for debate. But I'd say those would be the best sources for those definitions.

A BED or GTF file with exon definitions is easy to get (GENCODEs files are here). You can use those to get start and stop coordinates of exons. From there it would be easy to determine all of the exons whose length isn't a multiple of three as a rough first pass anyway. You could probably do a more refined approach with a mapping of codon locations to genomic coordinates as well.