If I'm not wrong (I havent checked...) , the following script should retrieve all the constitutive exons from UCSC/wgEncodeGencodeCompV27 (may be not the best source of transcripts)
# remove existing sqlite3 table
rm -f tmp.sqlite
# download genes, create one table exon/transcript
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/wgEncodeGencodeCompV27.txt.gz" |\
gunzip -c | awk -F '\t' 'BEGIN {printf("create table T1(exon TEXT,transcript text);\nBEGIN TRANSACTION;\n");}{nExons=int($9);split($10,starts,/,/);split($11,ends,/,/); for(i=1;i<=nExons;i++) printf("INSERT INTO T1(exon,transcript) VALUES(\"%s_%s_%s\",\"%s\");\n",$3,starts[i],ends[i],$2);}END{printf("COMMIT;\n");}' | sqlite3 tmp.sqlite
# download genes into, create one table gene/transcript
curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/wgEncodeGencodeAttrsV27.txt.gz" | gunzip -c | cut -f1,5 | sort | uniq | awk -F '\t' 'BEGIN {printf("create table T2(gene TEXT,transcript text);\nBEGIN TRANSACTION;\n");} {printf("INSERT INTO T2(gene,transcript) VALUES(\"%s\",\"%s\");\n",$1,$2);}END{printf("COMMIT;\n");}' | sqlite3 tmp.sqlite
# count number of transcripts per gene
sqlite3 -noheader tmp.sqlite 'select gene from T2;' | sort | uniq -c| awk 'BEGIN {printf("create table T3(gene TEXT,num INTEGER);\nBEGIN TRANSACTION;\n");}{printf("insert into T3(num,gene) values(%s,\"%s\");\n",$1,$2);}END{printf("COMMIT;\n");}' > tmp.sql && sqlite3 tmp.sqlite < tmp.sql && rm tmp.sql
# count number of transcript per exon
sqlite3 -noheader tmp.sqlite 'select exon from T1;' | sort | uniq -c| awk 'BEGIN {printf("create table T4(exon TEXT,num INTEGER);\nBEGIN TRANSACTION;\n");}{printf("insert into T4(num,exon) values(%s,\"%s\");\n",$1,$2);}END{printf("COMMIT;\n");}' > tmp.sql && sqlite3 tmp.sqlite < tmp.sql && rm tmp.sql
# join everything
sqlite3 tmp.sqlite 'select T1.exon,T2.gene,T4.num from T1,T2,T3,T4 where T1.transcript = T2.transcript and T2.gene = T3.gene and T4.exon=T1.exon and T4.num=T3.num; '
# cleanup
rm tmp.sqlite
output:
(...)
chr22_50605367_50605443|ENSG00000008735.13|2
chr22_50605561_50605734|ENSG00000008735.13|2
chr22_50605824_50605934|ENSG00000008735.13|2
(...)
note to self: "constitutive exons"— exons which are consistently conserved after splicing
Neither
scrib
norcd44
should be annotated as having constitutive exons, are you finding something that shows otherwise?My PI does :) .Yeah I do understand and I'm totally agreed in fact. That's really depending on annotations you look at. In UCSC, gencode 24 is displayed for annotation and it seems exon(35/37) is alternative but all the others seems constitutive. When you look gtf imported from Gencode 25 based on Ensembl 85 http://jul2016.archive.ensembl.org/index.html, new transcripts are shown and effectively I see it changes who is alternative or constitutive. Look here : https://github.com/ZheFrenchKitchen/pics/blob/master/SCRIB.png