Hello everybody,
I queried a FASTA file containing all 3' UTRs of the current human genome assembly.
For the header line I chose GeneID, TranscriptID, 3' UTR start, 3' UTR end, strand and chromosome name.
Now I'm having problems assigning chromosome positions to each nucleotide:
AFAIK Ensembl coordinates are one-based and including the given start and end position.
Thus >ENSG007|ENST0815|100;150|120;159|1|X should have sequence length 30.
I expected the start list and end list to be sorted and the exons to be non-overlapping.
Thus for >ENSG...|ENST...|s_1;s_2;...;s_n|e_1;e_2;...;e_n|...|... s_i<=e_i<s_(i+1) should="" be="" true="" for="" i="" in="" [1..n].="" <br=""/>
But I encountered lines like the following:
...
ENSG00000026297|ENST00000028008|167343033;167360170;167356507;167344532;167352383;167347579|167343279;167360218;167356577;167344606;167352496;167347624|-1|6 ... ENSG00000011198|ENST00000013894|43740996;43743707;43743404;43753201|43741025;43744079;43743500;43753356|1|3 ...
How do I correctly assign assign the correct position on chromosome to each nucleotide?
Thanks for your help, I might just be too confused at the moment...
Marcel