Question

Calculating each gene's mean of intron length

1

Entering edit mode

5.9 years ago

userlee7655 ▴ 10

Hi, I want to calculate the mean of intron length of each gene in mouse(mm10) or human(hg19 or hg38). The rationale of doing this is trying to categorize genes into 4 or 5 bins based on their intron length. As the previously published paper "https://www.ncbi.nlm.nih.gov/pubmed/21358643", in Table 1, they classified genes into 4 groups depending on their intron length(0-1kb, 1-10kb,10-100kb,>100kb) and count the gene number in each class. Could anyone give me any suggestion to do this?

Thank you for reading my message. If there is anything I should mention but I don't, please let me know. I am pretty much rooky in bioinformatic analysis, I ready to take any suggestion and opinion.

RNA-Seq rna-seq genome gene sequencing • 2.6k views

ADD COMMENT • link 5.9 years ago by userlee7655 ▴ 10

0

Entering edit mode

Thank for you two give me such good solution. Give me some time, I will try it both. I am not familiar with JAVA script and perl but I will try my best.

Thank you all again, I even did not expect such fast respond.

Sincerely

ADD REPLY • link 5.9 years ago by userlee7655 ▴ 10

score 3 · Answer 1 · 2019-09-04

Along with Pierre's script, if you have your own GTF file, you could use this script from leafcutter.

https://github.com/davidaknowles/leafcutter/blob/master/leafviz/gtf2leafcutter.pl

This will produce a file called <prefix>_all_introns.bed.gz which contains all introns and corresponding gene names in a bed format. You could simply use groupBy from bedtools to get whatever metric you need.

zcat _v2_all_introns.bed.gz | awk -v OFS="\t" '{ print $4,$3-$2}' | groupBy -g 1 -c 2 -o mean

score 1 · Answer 2 · 2019-09-04

using BioAlcidaeJdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

I'm treating each intron of each transcript as a distinct entity.

$ wget -O - -q "http://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz" | gunzip -c |\
java -jar dist/bioalcidaejdk.jar -F GTF -e 'stream().forEach(G->println(G.getId()+"\t"+G.getGeneName()+"\t"+G.getTranscripts().stream().flatMap(T->T.getIntrons().stream()).mapToInt(I->I.getLengthOnReference()).average().orElse(-99) ));' 

ENSMUSG00000029836  Cbx3    2392.5
ENSMUSG00000029837  1700111E14Rik   11957.25
ENSMUSG00000029830  Svopl   3663.7948717948716
ENSMUSG00000005864  Cnga2   3509.5714285714284
ENSMUSG00000029831  Npvf    1478.0
ENSMUSG00000029832  Nfe2l3  8307.166666666666
ENSMUSG00000029833  Trim24  5817.84
ENSMUSG00000029838  Ptn 23120.14285714286
ENSMUSG00000030827  Fgf21   327.0
ENSMUSG00000030825  Hsd17b14    1523.090909090909
ENSMUSG00000030826  Bcat2   1856.8392857142858
ENSMUSG00000030823  9130019O22Rik   533.0
ENSMUSG00000030824  Nucb1   1569.7358490566037
ENSMUSG00000030822  Prr14   622.8695652173913
ENSMUSG00000005871  Apc 6496.068493150685
ENSMUSG00000029823  Luc7l2  5344.425
ENSMUSG00000029826  Zc3hav1 3994.5384615384614
ENSMUSG00000029821  Gsdme   6530.684210526316
ENSMUSG00000029822  Osbpl3  5717.4789915966385
ENSMUSG00000005873  Reep5   8005.071428571428
(...)