Question

Extracting ENST ids from coordinates

0

Entering edit mode

3.5 years ago

graeme.thorn ▴ 110

I have a set of results from Whippet, which lists deltaPsi values for segments of genes (which may or may not be exons). For the analysis I have the relevant Ensembl Gene IDs, but I need to identify which specific isoforms involve these segments that are flagged as significantly different between conditions.

Is there a quick programmatic way of extracting all transcripts for the particular gene which contain a particular segment coordinates?

biomart whippet • 1.6k views

ADD COMMENT • link 3.5 years ago by graeme.thorn ▴ 110

0

Entering edit mode

You tagged the biomart, does that mean you have tried that but could not find what you are looking for? Asking this because what you describe as a problem is a kind of job that the biomart can help with an answer.

ADD REPLY • link 3.5 years ago by Hamid Ghaedi 3.3k

0

Entering edit mode

I tagged it biomart as I expect there probably is a solution using biomart but I'm not too au fait with it (or biomaRt, the R package) to extract what I need.

ADD REPLY • link 3.5 years ago by graeme.thorn ▴ 110

0

Entering edit mode

I'm afraid BioMart is not the best way of doing it as it's gene oriented. If you decide to output Ensembl transcript stable IDs (ENSTs) for a given genomic region, the the BioMart is going to look for a gene overlapping this region and print all of the gene's transcripts. You could, however, do it using the REST API and overlap endpoint described here: https://rest.ensembl.org/documentation/info/overlap_region Here's na example: https://rest.ensembl.org/overlap/region/human/17:27630005-27630969?feature=transcript;content-type=application/json

ADD REPLY • link 3.5 years ago by Michal @Ensembl ▴ 270

score 1 · Accepted Answer · 2021-12-08

This may not be the most efficient way of doing this, but I extracted the annotated exons from the GTF feature file, and extracted the locations and ENST ids into a BED4 file, using grep, sed and awk. Similarly, the Whippet segment files were converted into a BED4 file, so the files were of the form

chr1 158831351 158831557 ENSG00000163563.1
...

(the .1 is the indicator for the Whippet segment) and

chr1 158831351 158831557 ENST00000368141
chr1 158831351 158831557 ENST00000491210
...

then ran a bedtools intersect -wa -wb -a Whippet.bed -b ENST.bed to get

chr1 158831351 158831557 ENSG00000016363.1 chr1 158831351 158831557 ENST00000368141
chr1 158831351 158831557 ENSG00000016363.1 chr1 158831351 158831557 ENST00000491210
...

This is the right format now for me to get the affected transcripts in the Whippet PSI output.