Entering edit mode
5.8 years ago
Sergio MartÃnez Cuesta
▴
230
This is an attempt to convert a bed file containing regions of interest from genomic coordinates to transcript coordinates. If you have come across something related or know a better way of doing this, it would be great to know.
The regions of interest and transcripts in genomic coordinates:
head myregions.bed
chr1 826780 826810
chr1 956954 956984
chr1 956991 957021
chr1 963199 963229
chr1 965556 965586
chr1 999818 999848
chr1 1054882 1054912
chr1 1232003 1232033
chr1 1254976 1255006
chr1 1324883 1324913
head transcripts.bed
chr1 11869 14409 ENST00000456328.2 ENSG00000223972.5 +
chr1 12010 13670 ENST00000450305.2 ENSG00000223972.5 +
chr1 29554 31097 ENST00000473358.1 ENSG00000243485.5 +
chr1 30267 31109 ENST00000469289.1 ENSG00000243485.5 +
chr1 30366 30503 ENST00000607096.1 ENSG00000284332.1 +
chr1 52473 53312 ENST00000606857.1 ENSG00000268020.3 +
chr1 57598 64116 ENST00000642116.1 ENSG00000240361.2 +
chr1 62949 63887 ENST00000492842.2 ENSG00000240361.2 +
chr1 65419 71585 ENST00000641515.2 ENSG00000186092.6 +
chr1 69055 70108 ENST00000335137.4 ENSG00000186092.6 +
Obtain intersecting regions with transcripts:
bedtools intersect -a myregions.bed -b transcripts.bed -wb > myregions.transcripts.intersect.bed
head myregions.transcripts.intersect.bed
chr1 826780 826810 chr1 826206 827522 ENST00000473798.1 ENSG00000225880.5 -
chr1 826780 826810 chr1 825138 849592 ENST00000624927.3 ENSG00000228794.8 +
chr1 826780 826810 chr1 594308 827769 ENST00000635509.2 ENSG00000230021.9 -
chr1 826780 826810 chr1 594308 827796 ENST00000634337.2 ENSG00000230021.9 -
chr1 956954 956984 chr1 954426 959309 ENST00000487214.1 ENSG00000188976.10 -
chr1 956954 956984 chr1 944204 959290 ENST00000327044.6 ENSG00000188976.10 -
chr1 956954 956984 chr1 944205 958458 ENST00000477976.5 ENSG00000188976.10 -
chr1 956991 957021 chr1 954426 959309 ENST00000487214.1 ENSG00000188976.10 -
chr1 956991 957021 chr1 944204 959290 ENST00000327044.6 ENSG00000188976.10 -
chr1 956991 957021 chr1 944205 958458 ENST00000477976.5 ENSG00000188976.10 -
Set intersecting regions in transcript coordinates ...
s----region------e
S--------------transcript-----------------------E
- if intersection in + strand, start = s-S+1 and end = e-S+1
- if intersection in - strand, start = E-e+1 and end = E-s+1
... using python:
ifile = open("myregions.transcripts.intersect.bed", "r")
ilines = ifile.readlines()
ifile.close()
t_coords = []
for line in ilines:
fields=line.split()
transcript=fields[6]
strand=fields[8]
s=int(fields[1])
e=int(fields[2])
S=int(fields[4])
E=int(fields[5])
if strand == "+":
ns=str(s-S+1)
ne=str(e-S+1)
if strand == "-":
ns=str(E-e+1)
ne=str(E-s+1)
t_coords.append([transcript, ns, ne])
ofile = open("myregions.transcripts.intersect.transcriptcoords.bed", "w")
for i in t_coords:
ofile.write("\t".join(i) + "\n")
ofile.close()
Any ideas on how to improve this would be helpful.
You could do something like below with awk:
awk -v OFS="\t" '{ if (
$9=="-"
) }' myregions.transcripts.intersect.bedYour s,S,e,E etc can be easily obtained by considering appropriate columns.