Question

Convert regions from genomic coordinates to transcript coordinates

0

Entering edit mode

6.5 years ago

Sergio Martínez Cuesta ▴ 230

This is an attempt to convert a bed file containing regions of interest from genomic coordinates to transcript coordinates. If you have come across something related or know a better way of doing this, it would be great to know.

The regions of interest and transcripts in genomic coordinates:

head myregions.bed
chr1    826780  826810
chr1    956954  956984
chr1    956991  957021
chr1    963199  963229
chr1    965556  965586
chr1    999818  999848
chr1    1054882 1054912
chr1    1232003 1232033
chr1    1254976 1255006
chr1    1324883 1324913

head transcripts.bed
chr1    11869   14409   ENST00000456328.2   ENSG00000223972.5   +
chr1    12010   13670   ENST00000450305.2   ENSG00000223972.5   +
chr1    29554   31097   ENST00000473358.1   ENSG00000243485.5   +
chr1    30267   31109   ENST00000469289.1   ENSG00000243485.5   +
chr1    30366   30503   ENST00000607096.1   ENSG00000284332.1   +
chr1    52473   53312   ENST00000606857.1   ENSG00000268020.3   +
chr1    57598   64116   ENST00000642116.1   ENSG00000240361.2   +
chr1    62949   63887   ENST00000492842.2   ENSG00000240361.2   +
chr1    65419   71585   ENST00000641515.2   ENSG00000186092.6   +
chr1    69055   70108   ENST00000335137.4   ENSG00000186092.6   +

Obtain intersecting regions with transcripts:

bedtools intersect -a myregions.bed -b transcripts.bed -wb > myregions.transcripts.intersect.bed

head myregions.transcripts.intersect.bed
chr1    826780  826810  chr1    826206  827522  ENST00000473798.1   ENSG00000225880.5   -
chr1    826780  826810  chr1    825138  849592  ENST00000624927.3   ENSG00000228794.8   +
chr1    826780  826810  chr1    594308  827769  ENST00000635509.2   ENSG00000230021.9   -
chr1    826780  826810  chr1    594308  827796  ENST00000634337.2   ENSG00000230021.9   -
chr1    956954  956984  chr1    954426  959309  ENST00000487214.1   ENSG00000188976.10  -
chr1    956954  956984  chr1    944204  959290  ENST00000327044.6   ENSG00000188976.10  -
chr1    956954  956984  chr1    944205  958458  ENST00000477976.5   ENSG00000188976.10  -
chr1    956991  957021  chr1    954426  959309  ENST00000487214.1   ENSG00000188976.10  -
chr1    956991  957021  chr1    944204  959290  ENST00000327044.6   ENSG00000188976.10  -
chr1    956991  957021  chr1    944205  958458  ENST00000477976.5   ENSG00000188976.10  -

Set intersecting regions in transcript coordinates ...

            s----region------e
S--------------transcript-----------------------E

- if intersection in + strand, start = s-S+1 and end = e-S+1
- if intersection in - strand, start = E-e+1 and end = E-s+1

... using python:

ifile = open("myregions.transcripts.intersect.bed", "r")
ilines = ifile.readlines()
ifile.close()

t_coords = []

for line in ilines:
  fields=line.split()
  transcript=fields[6]
  strand=fields[8]
  s=int(fields[1])
  e=int(fields[2])
  S=int(fields[4])
  E=int(fields[5])
  if strand == "+":
    ns=str(s-S+1)
    ne=str(e-S+1)
  if strand == "-":
    ns=str(E-e+1)
    ne=str(E-s+1)    
  t_coords.append([transcript, ns, ne])


ofile = open("myregions.transcripts.intersect.transcriptcoords.bed", "w")

for i in t_coords:
  ofile.write("\t".join(i) + "\n")

ofile.close()

Any ideas on how to improve this would be helpful.

bed transcript genome • 1.8k views

ADD COMMENT • link 6.5 years ago by Sergio Martínez Cuesta ▴ 230

0

Entering edit mode

You could do something like below with awk:

awk -v OFS="\t" '{ if ( $9=="-" ) }' myregions.transcripts.intersect.bed

Your s,S,e,E etc can be easily obtained by considering appropriate columns.

ADD REPLY • link 6.5 years ago by GouthamAtla 12k