I cannot figure out how to pull out the coding sequence from KnownGeneMrna.
I know that sequences in KnownGeneMrna contain UTRs, so what I am doing is taking CDS_start
- tx_start
from KnownGene to find the start of the CDS from the beginning of the KnownGeneMrna sequence. The columns in KnownGene are:
{accession},{chrom},{strand},{tx_start},{tx_end},{CDS_start},{CDS_end} etc..
The problem is that some transcripts are shorter than the offset! For instance uc010nyq.2. Why is this and what am I doing wrong? I have found other related posts but none that address this point.
Thanks,
Jeremy
How did you find this?
Where is the problem?
The problem is that I'm misinterpreting something; probably to do with the contents of knownGeneMrna.
1558790-1551689 = 7101
I take this to mean (erroneously I'm sure) that if I pull the sequence from knownGeneMrna then the coding sequence will begin 7101 bps from the start of the transcript.
The tx length for uc010nyq.2 from knownGeneMrna is 3317, so this cannot be the right assumption.
Thanks Pierre. I could do it this way if I were pulling the coding sequences directly from the hg19 chromosome fasta files. This seems silly though because they are already assembled in knownGeneMrna. The problem is that the UTRs are included in these sequences and I want to remove the UTRs so as to be left with the coding sequence.
Below is the unabridged sequence for uc010nyq.2 taken from knownGeneMrna. The section bounded by
[
(line 5) and]
(the penultimate line) is the coding sequence. The UTR offset to the start of the coding sequence is 327 characters. Where is this offset annotated or is there another file that only has the CDS?Please don't post your comments as a new answer.