Calculating each gene's mean of intron length
2
0
Entering edit mode
5.2 years ago

Hi, I want to calculate the mean of intron length of each gene in mouse(mm10) or human(hg19 or hg38). The rationale of doing this is trying to categorize genes into 4 or 5 bins based on their intron length. As the previously published paper "https://www.ncbi.nlm.nih.gov/pubmed/21358643", in Table 1, they classified genes into 4 groups depending on their intron length(0-1kb, 1-10kb,10-100kb,>100kb) and count the gene number in each class. Could anyone give me any suggestion to do this?

Thank you for reading my message. If there is anything I should mention but I don't, please let me know. I am pretty much rooky in bioinformatic analysis, I ready to take any suggestion and opinion.

RNA-Seq rna-seq genome gene sequencing • 2.1k views
ADD COMMENT
0
Entering edit mode

Thank for you two give me such good solution. Give me some time, I will try it both. I am not familiar with JAVA script and perl but I will try my best.

Thank you all again, I even did not expect such fast respond.

Sincerely

ADD REPLY
2
Entering edit mode
5.2 years ago

Along with Pierre's script, if you have your own GTF file, you could use this script from leafcutter.

https://github.com/davidaknowles/leafcutter/blob/master/leafviz/gtf2leafcutter.pl

This will produce a file called <prefix>_all_introns.bed.gz which contains all introns and corresponding gene names in a bed format. You could simply use groupBy from bedtools to get whatever metric you need.

zcat _v2_all_introns.bed.gz | awk -v OFS="\t" '{ print $4,$3-$2}' | groupBy -g 1 -c 2 -o mean
ADD COMMENT
1
Entering edit mode
5.2 years ago

using BioAlcidaeJdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

I'm treating each intron of each transcript as a distinct entity.

$ wget -O - -q "http://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.gtf.gz" | gunzip -c |\
java -jar dist/bioalcidaejdk.jar -F GTF -e 'stream().forEach(G->println(G.getId()+"\t"+G.getGeneName()+"\t"+G.getTranscripts().stream().flatMap(T->T.getIntrons().stream()).mapToInt(I->I.getLengthOnReference()).average().orElse(-99) ));' 

ENSMUSG00000029836  Cbx3    2392.5
ENSMUSG00000029837  1700111E14Rik   11957.25
ENSMUSG00000029830  Svopl   3663.7948717948716
ENSMUSG00000005864  Cnga2   3509.5714285714284
ENSMUSG00000029831  Npvf    1478.0
ENSMUSG00000029832  Nfe2l3  8307.166666666666
ENSMUSG00000029833  Trim24  5817.84
ENSMUSG00000029838  Ptn 23120.14285714286
ENSMUSG00000030827  Fgf21   327.0
ENSMUSG00000030825  Hsd17b14    1523.090909090909
ENSMUSG00000030826  Bcat2   1856.8392857142858
ENSMUSG00000030823  9130019O22Rik   533.0
ENSMUSG00000030824  Nucb1   1569.7358490566037
ENSMUSG00000030822  Prr14   622.8695652173913
ENSMUSG00000005871  Apc 6496.068493150685
ENSMUSG00000029823  Luc7l2  5344.425
ENSMUSG00000029826  Zc3hav1 3994.5384615384614
ENSMUSG00000029821  Gsdme   6530.684210526316
ENSMUSG00000029822  Osbpl3  5717.4789915966385
ENSMUSG00000005873  Reep5   8005.071428571428
(...)
ADD COMMENT
0
Entering edit mode

Dear Pierre Lindenbaum, Since I am not a bioinformatician, if the following questions are really annoyed you, I think I should apologize first. If I understand your code right, you operate these code under Linux. The first line of code is to download the GTF file. And the second line is how you apply the Java to run the Jvarkit and the rest of is that how you calculate the intron length. But between these two lines, how to import GTF into the bioalcidaejdk? And if it is possible...is there anything like manuscript or protocol could let me do step by step?

Sorry about asking for such a stupid question, but I could not find anyone nearby to ask.

ADD REPLY
1
Entering edit mode

bioalcidaejdk reads the GTF on standard input.

ADD REPLY

Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6