I have a genomic region of interest and I am trying to retrieve the gene (if its in one) and amino acid sequence (if its in a translated region). Suppose my region of interest is:
chromosome = "chr19" start = 55014736 stop = 55014753
If I put this into USCS genome browser by hand I see it's in gene GP6 and the translation is LCVSLC. I can use cruzdb and retrieve the correct gene but the translation of cds_sequence does not contain the LCVSLC motif. The program I tried is posted below and you can test that the following command produces no output:
python myCruzdbProgram.py | awk '/LCVSLC/'
How do I simply get the translation programatically for my region of interest?
import cruzdb
from cruzdb import Genome
def translate_dna(sequence):
gencode = { 'ATA':'I',
'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T',
'ACT':'T', 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S',
'AGT':'S', 'AGA':'R', 'AGG':'R', 'CTA':'L', 'CTC':'L', 'CTG':'L',
'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 'CAC':'H',
'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R',
'CGT':'R', 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A',
'GCC':'A', 'GCG':'A', 'GCT':'A', 'GAC':'D', 'GAT':'D', 'GAA':'E',
'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 'TCA':'S',
'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L',
'TTG':'L', 'TAC':'Y', 'TAT':'Y', 'TAA':'\_', 'TAG':'\_', 'TGC':'C',
'TGT':'C', 'TGA':'_', 'TGG':'W', }
proteinseq = ''
for n in range(0,len(sequence),3):
if gencode.has_key(sequence[n:n+3]) == True:
proteinseq += gencode[sequence[n:n+3]]
return proteinseq
g = cruzdb.Genome('hg38')
# this is my region of interest
chromosome = "chr19"
start = 55014736
stop = 55014753
# retrieve annotation of the region with cruzdb
gene = g.bin_query('refGene',chromosome,start,stop).all()
# print the gene name ( it does find the correvt gene )
print gene[0].name2
# get the translated sequence
sequence = "".join(gene[0].cds_sequence)
print sequence
# translation of my 18 base pair region should be LCVSLC but it is not
print translate_dna(sequence.upper())
thank you! I have done both; but, I have a followup question. If we take another look at the region I used as an example:
and compare to the entry in the refGene.txt file from UCSC we see that this region is between the transcription start and end coordinates but not between the cds start and end coordinates.
I conclude that my region is transcribed but not translated. However, If I look up my region on the UCSC genome browser:
I see a translation of my region under "multiz alignments of 100 vertebrates". I suppose the multiz alignment does not necessarily mean the region is translated? Should I stick to the CDS coordinates in the refGene.txt file?
For that particular transcript, you are in the UTR: https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg38_utrGp6
The position you are interested in is the blue highlight. The reason you see translation in the Multiz 100 way track is because there is transcript that has a coding region there.
Also note that the codon translation you see in the 100-way alignment is according to the gene tracks as described in the "Base Level" section of the description page: https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr19&g=cons100way#TRACK_HTML
And so you will see different codon translations where there are differences between refGene and the knownGene tables.
If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:
ChrisL from the UCSC Genome Browser