Question

Mapping Genome Coordinates From Ensembl Biomart Fasta Header To Sequence

2

Entering edit mode

13.2 years ago

Marcel Schilling ▴ 20

Hello everybody,

I queried a FASTA file containing all 3' UTRs of the current human genome assembly.
For the header line I chose GeneID, TranscriptID, 3' UTR start, 3' UTR end, strand and chromosome name.
Now I'm having problems assigning chromosome positions to each nucleotide:
AFAIK Ensembl coordinates are one-based and including the given start and end position. Thus >ENSG007|ENST0815|100;150|120;159|1|X should have sequence length 30.
I expected the start list and end list to be sorted and the exons to be non-overlapping. Thus for >ENSG...|ENST...|s_1;s_2;...;s_n|e_1;e_2;...;e_n|...|... s_i<=e_i<s_(i+1) should="" be="" true="" for="" i="" in="" [1..n].="" <br=""/> But I encountered lines like the following:

...

ENSG00000026297|ENST00000028008|167343033;167360170;167356507;167344532;167352383;167347579|167343279;167360218;167356577;167344606;167352496;167347624|-1|6 ... ENSG00000011198|ENST00000013894|43740996;43743707;43743404;43753201|43741025;43744079;43743500;43753356|1|3 ...

How do I correctly assign assign the correct position on chromosome to each nucleotide?

Thanks for your help, I might just be too confused at the moment...

Marcel

ensembl coordinates exon • 3.6k views

ADD COMMENT • link updated 11.4 years ago by Malachi Griffith 20k • written 13.2 years ago by Marcel Schilling ▴ 20

score 3 · Answer 1 · 2011-09-01

The coordinates are correct, but indeed not in the right order. Compare the output of BioMart for example with what is shown on the website for ENST00000028008. I have mailed the person responsible for Ensembl BioMart and she will contact the BioMart developers at OICR to see whether this can be fixed. In the meantime I am afraid the only option for you is to order the coordinates by yourself.

score 1 · Answer 2 · 2011-11-14

1

Entering edit mode

13.0 years ago

Bert Overduin ★ 3.7k

This is indeed confusing and will be brought up again with the BioMart developers at OICR. In the meantime, please either use the Ensembl browser to check any unclear cases or use the Ensembl API to retrieve your data instead of BioMart.

ADD COMMENT • link 13.0 years ago by Bert Overduin ★ 3.7k

score 0 · Answer 3 · 2011-11-14

The ordering Bert suggested cannot always help.

Here is a case when for three transcript IDs there are three starts and two ends of UTRs: gene ENSMUSG00000000028 transcripts ENSMUST00000096990;ENSMUST00000000028;ENSMUST00000115585 starts 18781989;18780540;18780546 ends 18781990;18780666 chromosome 16 strand -1

Which goes with which?

Unless this gets corrected somehow, maybe Ensembl should shut down this option - the results seem to be difficult to interpret.

Cheers,

Nenad Bartonicek PhD student, Enright group European Bioinformatics Institute Hinxton Cambridge CB10 1SD United Kingdom