You can obtain a .bed
file with gene-body coordinates and a .bed
file with exon coordinates from UCSC or Ensembl. After that you can use bedtools subtract
to subtract the exons from the gene bodies - this will give you an output file with all of the intron coordinates.
bedtools subtract -a gene_body.bed -b exons.bed -s > introns.bed
After that you can calculate the length and average the size across genes using whichever language you prefer. Personally, I'd perform this in R
:
Input format (introns.bed
):
chr1 11873 12226 DDX11L1.1 . +
chr1 12612 12720 DDX11L1.1 . +
chr1 13220 14408 DDX11L1.1 . +
I'd run the code:
R --vanilla
inFile <- read.table('introns.bed',header=F,as.is=T,sep='\t')
colnames(inFile) <- c('chr','start','stop','gene','score','strand')
inFile$intronLength <- inFile$stop - inFile$start
meanIntronLength <- aggregate(inFile$intronLength,list(inFile$gene),mean)
write.table(meanIntronLength,'output_meanIntronLength.txt',row.names=F,quote=F,sep='\t')
q()
This will read in the file, calculate the intron length, then aggregate via gene name by taking the mean of the intron lengths. The output file will be a two-column file with gene name and intron length.
This question appears to be identical to: Calculating each gene's mean of intron length. It is not a good practice to post the same question using two different accounts.