The following awk script extract the size of the first intron:
BEGIN {
FS="\t";
}
{
split($9,exonStarts,",");
split($10,exonEnds,",");
geneSize=1.0*int($5)-int($4);
exonCount=int($8);
if(exonCount<2)
{
next;
}
if($3=="+")
{
printf("%f\t%s\n",(exonStarts[2]-exonEnds[1])/geneSize,$0);
}
else
{
printf("%f\t%s\n",(exonStarts[exonCount]-exonEnds[exonCount-1])/geneSize,$0);
}
}
example:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select * from knownGene' -N | awk -f script.awk
0.151814 uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3
0.144716 uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
0.164826 uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1
0.276321 uc009vis.2 chr1 - 14362 16765 14362 14362 4 14362,14969,15795,16606, 14829,15038,15942,16765, uc009vis.2
0.101167 uc009vit.2 chr1 - 14362 19759 14362 14362 9 14362,14969,15795,16606,16857,17232,17914,18267,18912, 14829,15038,15947,16765,17055,17742,18061,18366,19759, uc009vit.2
0.101167 uc001aae.3 chr1 - 14362 19759 14362 14362 10 14362,14969,15795,16606,16857,17232,17605,17914,18267,18912, 14829,15038,15947,16765,17055,17368,17742,18061,18366,19759, uc001aae.3
0.066333 uc009viu.2 chr1 - 14362 19759 14362 14362 10 14362,14969,15795,16606,16857,17232,17914,18267,18500,18912, 14829,15038,15947,16765,17055,17742,18061,18369,18554,19759, uc009viu.2
0.603283 uc001aab.3 chr1 - 14362 24901 14362 14362 10 14362,14969,15795,16606,16853,17232,17605,17914,18267,24737, 14829,15038,15947,16765,17055,17368,17742,18061,18379,24901, uc001aab.3
0.295109 uc001aah.3 chr1 - 14362 29370 14362 14362 11 14362,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320, 14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370, uc001aah.3
0.295109 uc009vir.2 chr1 - 14362 29370 14362 14362 10 14362,14969,15795,16606,16857,17232,17914,18267,24737,29320, 14829,15038,15947,16765,17055,17742,18061,18366,24891,29370, uc009vir.2
AWESOME ! Thanks Pierre, is there a way to detect gene length from the output ? I will take a look at the desctiption of the table knoznGene to understand the columns. Thanks for your answer
Oh yeah I see from your script ! that's fine. Thank you
Pierre do you mind if I share your solution at biocoders.net as a small snippet tutorial ?