When examining the divisibility of total exon length of all UCSC knownGene transcripts (77614), 66.9% of them are not divisible, and even if I adjust for both UTRs and ignore all transcripts with cdsStart = cdsEnd (presumably non-coding), I still get 50% of non-divisible cDNAs.
Here's my script to evaluate divisibility:
#!/usr/bin/python
import fileinput
print "name", "length", "%3", "adj.", "adj%3"
# subset of columns: name, txStart, txEnd, cdsStart, cdsEnd, exonsStart, exonsEnd
for line in fileinput.input("7_columns.txt"):
name, txs, txe, cds, cde, es, ee = line.strip("\n").split("\t")
# exon starts and ends
exons = es.strip(",").split(",")
exone = ee.strip(",").split(",")
total_length = 0
if cds != cde:
# adjust = min(delta(txEnd, cdsEnd), delta(txEnd, last_exon_start) +
# min(delta(cdsStart, txStart), delta(first_exon_end, txEnd)) ;
# without these min()'s, applying adjust yields negative lengths
adjust = min(int(txe) - int(cde), int(txe) - int(exons[-1])) + min(int(cds) - int(txs), int(exone[0]) - int(txe))
else:
# this N helps filter output for non-coding entries
print "N",
adjust = 0
for e_index in range(len(exons)):
# sum up all exon lengths
total_length += int(exone[e_index]) - int(exons[e_index])
# adjust for UTRs
adjusted = total_length - adjust
if adjusted % 3 != 0:
print name, total_length, total_length % 3, adjusted, adjusted % 3
fileinput.close()
(I counted transcripts by piping output to grep -v N | wc -l
.)
I did read http://biostar.stackexchange.com/questions/7042/are-all-exon-nucleotide-sequence-counts-divisible-by-3-in-the-human-genome first.
I'd appreciate any hints, including things to try and change in the script.
As a directly related question - how can cdsStart be within the second exon? What is the meaning of the 1st exon in this case? Here's an example (cdsStart is 3bps up from the end of 2nd exon):
select name, strand, txstart, txend, cdsstart, cdsend, exonstarts, exonends from knowngene where name='uc001aau.2';
name | strand | txstart | txend | cdsstart | cdsend | exonstarts | exonends
------------+--------+---------+--------+----------+--------+-----------------------+-----------------------
uc001aau.2 | + | 323891 | 328580 | 324342 | 325605 | 323891,324287,324438, | 324060,324345,328580,
this would only be alarming if you could find an example where no splicing isoforms exist that are divisible by 3
Based on what Pierre has written below, it looks like UCSC transcripts themselves represent alternative splicing isoforms. How do I treat the "extra" nucleotides in non-divisible isoforms - just discard them? That doesn't seem right.