Subset of genes with at least 1 intron larger than 5000 kbp
1
2
Entering edit mode
6.3 years ago
shania90.lk ▴ 30

Hi to the experts,

Can I please get your help in trying to extract out genes that have at least 1 intron larger than 5 kbp? I have the GTF file and the reference genome for this species. What I want to know is, how many genes there are in the genome with at least 1 intron that has a length > 5 kbp. I have a feeling that GRanges object from the GTF file could be of use but still was unable to figure it out. I'm not sure whether I am being clear enough with my question though. Please feel free to ask questions to kindly help me find an answer for mine...

Thanks heaps, Shani.

GTF intron_length R Bash genomicFeatures • 4.0k views
ADD COMMENT
2
Entering edit mode

You should have a look at this post on how to extract introns from a GTF file. From there on, you can simply use something like awk to filter out every intron longer than a certain length, and then extract the gene names from it. There is no need for R, everything can be done from the Unix command line. Try to play around a bit, and feel free to come back in case of questions.

ADD REPLY
1
Entering edit mode

AFAIK, largest intron in human is less than 1 mb. You were asking about 5mb (5000 kb) intron. What is the organism?

ADD REPLY
1
Entering edit mode

I'm betting on a typo (== 5000bp) ... ?

ADD REPLY
0
Entering edit mode

Yes, you are right.... 5 kbp it is... thanks for pointing that out... this is for the barley genome though...

ADD REPLY
13
Entering edit mode
6.3 years ago

This can be done in R with the library GenomicFeatures.

library(GenomicFeatures)

## make TxDb from GTF file 
txdb <- makeTxDbFromGFF('Homo_sapiens.GRCh38.93.gtf')

## get intron information
all.introns <- intronicParts(txdb)

## subset for introns greater than 5 kb (side-note: there are no introns longer than 5000 kbp in the human hg38 genome)
subset.introns <- all.introns[width(all.introns) > 5000]

## look at the number of unique gene names present
##(what i do is unlist some metadata... convert it to strings and then count the unique gene_name occurrences) I agree its a bit hard to read...

length(unique(as.character(unlist(mcols(subset.introns)$gene_id))))
19469

In summary there are 61660 annotated introns > 5000 bp in size. These are derived from 19469 genes.

ADD COMMENT

Login before adding your answer.

Traffic: 2577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6