Mapping Genome Coordinates From Ensembl Biomart Fasta Header To Sequence
3
2
Entering edit mode
13.2 years ago

Hello everybody,

I queried a FASTA file containing all 3' UTRs of the current human genome assembly.
For the header line I chose GeneID, TranscriptID, 3' UTR start, 3' UTR end, strand and chromosome name.
Now I'm having problems assigning chromosome positions to each nucleotide:
AFAIK Ensembl coordinates are one-based and including the given start and end position. Thus >ENSG007|ENST0815|100;150|120;159|1|X should have sequence length 30.
I expected the start list and end list to be sorted and the exons to be non-overlapping. Thus for >ENSG...|ENST...|s_1;s_2;...;s_n|e_1;e_2;...;e_n|...|... s_i<=e_i<s_(i+1) should="" be="" true="" for="" i="" in="" [1..n].="" <br=""/> But I encountered lines like the following:

...

ENSG00000026297|ENST00000028008|167343033;167360170;167356507;167344532;167352383;167347579|167343279;167360218;167356577;167344606;167352496;167347624|-1|6 ... ENSG00000011198|ENST00000013894|43740996;43743707;43743404;43753201|43741025;43744079;43743500;43753356|1|3 ...

How do I correctly assign assign the correct position on chromosome to each nucleotide?

Thanks for your help, I might just be too confused at the moment...

Marcel

ensembl coordinates exon • 3.6k views
ADD COMMENT
3
Entering edit mode
13.2 years ago
Bert Overduin ★ 3.7k

The coordinates are correct, but indeed not in the right order. Compare the output of BioMart for example with what is shown on the website for ENST00000028008. I have mailed the person responsible for Ensembl BioMart and she will contact the BioMart developers at OICR to see whether this can be fixed. In the meantime I am afraid the only option for you is to order the coordinates by yourself.

ADD COMMENT
1
Entering edit mode
13.0 years ago
Bert Overduin ★ 3.7k

This is indeed confusing and will be brought up again with the BioMart developers at OICR. In the meantime, please either use the Ensembl browser to check any unclear cases or use the Ensembl API to retrieve your data instead of BioMart.

ADD COMMENT
0
Entering edit mode
13.0 years ago

The ordering Bert suggested cannot always help.

Here is a case when for three transcript IDs there are three starts and two ends of UTRs: gene ENSMUSG00000000028 transcripts ENSMUST00000096990;ENSMUST00000000028;ENSMUST00000115585 starts 18781989;18780540;18780546 ends 18781990;18780666 chromosome 16 strand -1

Which goes with which?

Unless this gets corrected somehow, maybe Ensembl should shut down this option - the results seem to be difficult to interpret.

Cheers,

Nenad Bartonicek PhD student, Enright group European Bioinformatics Institute Hinxton Cambridge CB10 1SD United Kingdom

ADD COMMENT

Login before adding your answer.

Traffic: 1647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6