How to convert Human Ensembl transcript coordinates to GRCh38
1
0
Entering edit mode
5.2 years ago
Alexandra • 0

Hello, I would like to convert Human Ensembl transcript coordinates to GRCh38. I tried to use R package "ensembldb 2.13.1", but it uses the old version of database (EnsDb.Hsapiens.v86) and is not suitable for my data from Ensemble release 100. Сould advise me some tool for this task?

R python Ensembl transcript conversion • 4.2k views
ADD COMMENT
0
Entering edit mode

Ensembl transcripts should be using the latest genome build. Can you provide examples of what you have that you want to convert?

ADD REPLY
0
Entering edit mode

Now I have a csv-file like this:

"ENST","Gene_name","Position"
"ENST00000000233.10","ARF5",133
"ENST00000000233.10","ARF5",145
"ENST00000000233.10","ARF5",82
"ENST00000000412.8","M6PR",153
"ENST00000000442.11","ESRRA",175

and I want to convert this to genome coordinates

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Could you please briefly explain me how to use BioMart for convertation? I previously used it just to export data.

ADD REPLY
0
Entering edit mode

If you have used BioMart before just cut the first column of your ID's and use that to restrict your search.

ADD REPLY
0
Entering edit mode

Sorry if my questions seem strange, I'm new to bioinformatics. I need to convert exact position in transcript (e.g. position 133 in ENST00000000233.10) to genome coordinate . I do not need the genomic coordinates of the entire transcript.

ADD REPLY
0
Entering edit mode

That you may need to do yourself and may require writing some custom code. Methods mentioned here will give you the genomics co-ordinates of the entire transcript.

ADD REPLY
0
Entering edit mode

Thenk you for your help!

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I have python code to do this if it is of any help.

ADD REPLY
0
Entering edit mode

Put it in a GitHub gist and link as a answer.

ADD REPLY
0
Entering edit mode

I've made my own code already, but it would be great to see the code of a more experienced user, please share it.

ADD REPLY
0
Entering edit mode
5.2 years ago

Sorry, thought I had posted this yesterday...

The gist is here:

'''
Copyright 2018 Ian Sudbery
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy,
modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS
OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
'''
from cgat import GTF
import numpy
import pandas
##################################################
class TranscriptCoordInterconverter:
'''Interconvert between genome domain and transcript domain
coordinates.
As there are expected to be many calls against the same transcript,
time can be saved by precomputation. Overlapping exons are merged.
Parameters
----------
transcript : sequence of `CGAT.GTF.Entry`
Set of GTF entires representing a transcript.
introns : bool, optional
Use introns instead of exons (see below).
Attributes
----------
transcript_id : str
Value of the transcript_id field of the transcript in quesiton.
Taken from the transcript_id field of the first entry in
`transcript`.
strand : str
Strand of the transcript. Taken from the strand field of the first
entry in `transcript`.
offset : int
Position of the start of the transcript in genome coordinates
genome_intervals : list of tuples of int
Coordinates of exons (or introns) in genome space as the difference
from `offset`. These are sorted in transcript order (see below)
transcript_intervals : list of tuples of int
Coordinates of exons (or introns) in transcript space. That is
absolute distance from transcription start site after splicing.
length : int
Total length of intervals (exons or introns) in the transcript
Notes
-----
Imagine the following transcript::
chr1 protein_coding exon 100 108 . - . transcript_id "t1"; gene_id "g1";
chr1 protein_coding exon 112 119 . - . transcript_id "t1"; gene_id "g1";
chr1 protein_coding exon 100 108 . - . transcript_id "t1"; gene_id "g1";
We can visualise the relationship between the different coordinate
domains as below::
Genome coordinates: 1 1 1 1
0 1 2 3
0123456789012345678901234567890
Transcript: |<<<<<<|----|<<<<<|-----|<<<<<|
Transcript Coords: 2 1 0
10987654 3210987 6543210
with `introns=True`: 8765 43210
Thus the intervals representing the exons in the transcript domain are
(0, 7), (7,14), (14, 22), and the genome base 115 corresponds to
transcript base 10.
TranscriptCoordInterconverter.genome2transcript should be the
interverse of TranscriptCoordInterconverter.transcript2genome.
That is if::
myConverter = TranscriptCoordInterverter(transcript)
then::
myConverter.genome2transcript(myConverter.transcript2genome(x)) == x
and::
myConverter.transcript2genome(myConverter.genome2transcript(x)) == x
'''
def __init__(self, transcript, introns=False):
''' Pre compute the conversions for each exon '''
if not introns:
intervals = GTF.asRanges(transcript, feature="exon")
else:
intervals = GTF.toIntronIntervals(transcript)
# get strand
self.strand = transcript[0].strand
# store transcript_id
try:
self.transcript_id = transcript[0].transcript_id
except AttributeError:
self.transcript_id = transcript[0].gene_id
# sort the exons into "transcript" order
if self.strand == "-":
intervals.sort(reverse=True)
intervals = [(y-1, x-1) for x, y in intervals]
else:
intervals.sort(reverse=False)
self.offset = intervals[0][0]
self.genome_intervals = [map(abs, (x-self.offset, y-self.offset))
for x, y in intervals]
interval_sizes = [abs(y-x) for x, y in intervals]
total = 0
transcript_intervals = [None]*len(interval_sizes)
for i in range(len(interval_sizes)):
transcript_intervals[i] = (total,
interval_sizes[i] + total)
total += interval_sizes[i]
self.transcript_intervals = transcript_intervals
self.length = transcript_intervals[-1][1]
def genome2transcript(self, pos):
'''Convert genome coordinate into transcript coordinates.
Can be a single coordinate or a array-like of coordinates
Parameters
----------
pos : int or array-like or int
positions, in the genome domain, to be converted
Returns
-------
int or numpy.array
The position, or positions converted into the transcript
domain.
Raises
------
ValueError
If the supplied genome position is not in the transcript.
This could be because it falls into one of the introns
(or exons if the converter was created with ``introns=True``,
or because the requested coordinates are before the start or
after the end of the transcript.
See Also
--------
transcript2genome : The inverse of this operation
genome_interval2transcript : Convert intervals rather than single
positions
Notes
-----
A key point to be aware of is that this function converts positions
not intervals. Because of the zero-based half-open nature of python
coordinates, you can't just convert the start and end positions
into the transcript space. Use :method:`genome_interval2transcript`
for this.
Passing an array ensures that the transcript is only
searched once, ensuring O(n) performance rather than
O(nlogn)'''
if len(pos) == 0:
return np.array([])
try:
relative_pos = pos - self.offset
except TypeError:
relative_pos = np.array(pos) - self.offset
if self.strand == "-":
relative_pos = relative_pos * -1
ordering = np.argsort(relative_pos)
relative_pos = np.sort(relative_pos)
# pre allocate results list for speed
try:
results = np.zeros(len(relative_pos))
except TypeError:
relative_pos = np.array([relative_pos])
results = np.zeros(1)
i = 0
i_max = len(relative_pos)
# do linear search for correct exon
for exon, interval in enumerate(self.genome_intervals):
if relative_pos[i] < interval[0]:
raise ValueError("Position %i is not in transcript %s" %
(pos[i], self.transcript_id))
while relative_pos[i] < interval[1]:
pos_within_exon = relative_pos[i]-interval[0]
transcript_exon = self.transcript_intervals[exon]
transcript_position = transcript_exon[0] + pos_within_exon
results[i] = transcript_position
i += 1
if i == i_max:
return results[ordering]
# exon has not been found
raise ValueError("Position %i (%i relative) is not in transcript %s\n exons are %s" %
(pos[i], relative_pos[i], self.transcript_id, self.genome_intervals))
def transcript2genome(self, pos):
'''Convert transcript coordinates into genome coordinates.
Can be a single coordinate or a array-like of coordinates
Parameters
----------
pos : int or array-like or int
positions, in the transcript domain, to be converted
Returns
-------
int or numpy.array
The position, or positions converted into the genome
domain.
Raises
------
ValueError
If the supplied genome position is not in the transcript.
This would generatlly be because the supplied genome is
either negative or greater than the length of the transcript.
See Also
--------
genome2transcript : The inverse of this operation
transcript_interval2genome_intervals : Convert intervals rather than
single positions
Notes
-----
A key point to be aware of is that this function converts positions
not intervals. Because of the zero-based half-open nature of python
coordinates, you can't just convert the start and end positions
into the transcript space. Use
:method:`transcript_interval2genome_intervals` for this.
Passing an array ensures that the transcript is only
searched once, ensuring O(n) performance rather than
O(nlogn)'''
try:
if len(pos) == 0:
return np.array([])
except TypeError:
pos = np.array([pos])
# Converting a list is only efficient if the list is ordered
# however want to be able to return list in the same order it
# arrived, so remember the order and then sort.
ordering = np.argsort(pos)
pos = np.sort(pos)
# pre allocate results list for speed
results = np.zeros(len(pos))
i = 0
i_max = len(pos)
# do linear search for correct exon
for exon, interval in enumerate(self.transcript_intervals):
while pos[i] < interval[1]:
pos_within_exon = pos[i] - interval[0]
genome_exon = self.genome_intervals[exon]
relative_genome_position = genome_exon[0] + pos_within_exon
if self.strand == "-":
results[i] = (self.offset - relative_genome_position)
i += 1
else:
results[i] = (self.offset + relative_genome_position)
i += 1
if i == i_max:
return results[ordering]
# beyond the end of the transcript
ValueError("Transcript postion %i outside of transcript %s" %
(pos[i], self.transcript_id))
def transcript_interval2genome_intervals(self, interval):
'''Convert an interval in transcript coordinates and convert to
a list of intervals in genome coordinates.
Returns a list because a single interval might cross several
exons, and the returned list of intervals would exclude the
intronic sequence.
Parameters
----------
interval : tuple of int
Tuple with zero-based half open interval of form (start, end)
in the transcript domain.
Returns
-------
list of tuples
List of zero-based, half open (start, end) tuples that encompas
the same bases as described by `interval` but in the genome
domain.
Raises
------
ValueError
If the supplied strart or end is not in the transcript.
See Also
--------
transcript2genome : Does the actaul conversion
genome_interval2transcript : Almost the inverse of this
Notes
-----
Because this method returns possibly discontinous intervals
that will always only contain exonic regions (or only intronic if
the converter was created with ``introns=True``), it is not quite
guarenteed to be the inverse of genome_interval2transcript. This
is because a single interval that spans both introns and exons
can be passed to that method (as long as both start and end are
exonic), and will be converted to a single interval that covers
the exon bases, but converting the same interval back would give
to intervals covering only the exonic parts.
'''
outlist = []
for exon in self.transcript_intervals:
if interval[0] < exon[1]:
start = interval[0]
if interval[1] <= exon[1]:
outlist.append((start, interval[1]))
break
else:
outlist.append((start, exon[1]))
interval = (exon[1], interval[1])
genome_list = [tuple(self.transcript2genome((x, y-1))) for
x, y in outlist]
# these intervals are zero based-closed. Need to make half open
if self.strand == "+":
genome_list = [(x, y+1) for x, y in genome_list]
else:
genome_list = [(y, x+1) for x, y in genome_list]
return sorted(genome_list)
def genome_interval2transcript(self, interval):
'''Convert an interval in genomic coordinates into an interval
in transcript-coordinates.
Parameters
----------
interval : tuple of int
Tuple with zero-based half open interval of form
(``start``, ``end``) in the genome domain. ``start`` < ``end``
Returns
-------
tuple of int
Half open (start, end) tuple starts and ends at the same bases
as the interval described in `interval`, but in transcript
domain coordinates.
Raises
------
ValueError
If the supplied strart or end is not in the transcript.
See Also
--------
genome2transcript : Does the actaul conversion
genome_interval2transcript : Almost the inverse of this
Notes
-----
This is not quite the inverse of
:method:`transcript_interval2genome_intervals`, because of how
intervals that split across introns are handled. See
:method:`transcript_interval2genome_intervals` for details of the
difference.
.. warning::
This method current has a known issue where if the interval
includes the last base of the transcripts (one the +ve strand)
or the first base (on the -ve strand), an error will occur. I
aim to fix this before release.
'''
transcript_list = self.genome2transcript(interval)
if self.strand == "+":
return sorted(transcript_list)
else:
# half open
transcript_list = transcript_list[0] + 1, transcript_list[1] + 1
return sorted(transcript_list)

It depends oncgat as well as pandas and numpy. You use it like so:

 from cgat import GTF
 coord_tobe_translated = pandas.read_csv("mycoords.tsv")
 coord_tobe_translated.set_index("ENST")
 for transcript in GTF.transcript_iterator(GTF.iterator(open("my_gtf.gtf"))):
    converter = TranscriptCoordInterconverter(transcript)
    this_transcript_coords = coord_to_be_converted.loc[transcript[0].transcript_id]
    genome_coords = converter.transcript2genome(this_transcript_coords.position)
    for pos in genome_coords:
         print transcript[0].transcript_id, pos

Its a big rusty, written years ago, in python 2.7, but you get the idea. One of these days I'll get round to packing it up as a proper utility. . Presented "as is". No guarentees implied. Caveat emptor.

ADD COMMENT
0
Entering edit mode

Thank you very much, it is very useful!

ADD REPLY

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6