I have a really easy one for you today and it's annoying me I haven't found the answer myself yet.
My PI would like me to create a pie chart of types of genomic locations that occur in the whole genome. For example, what percentage of the whole genome is intronic, exonic, intergenic, a 5'UTR etc etc. I'm wondering which file I would use to create this and what tool? I'm thinking some sort of bed file of the whole genome to then annotate with Homer but I'm not sure exactly which file and format to go with. I have to do the hg19 UCSC genome as well as the newest rat Rnor6.0 ensembl genome.
> library(dplyr)
> library(Homo.sapiens)
# Get the Human TxDb object, and restrict it to standard chromosomes (no random or Un chromosomes)
> Tx.human = TxDb.Hsapiens.UCSC.hg19.knownGene
> keepStandardChromosomes(Tx.human)
# Total number of bases in the human genome.
> tot.wholegenome = sum(as.numeric(seqlengths(exons(Tx.human))))
[1] 3095693983
# Total bases covered by exons
> tot.exons = exons(Tx.human) %>%
reduce %>% # merge overlapping exons to avoid double-counting
width %>% # get width of each exon
sum
[1] 85928932
Now you have both the total number of bases in the genome, and the bases covered by exons. You can plot it with your library of preference (e.g. ggplot2)
To get introns, intergenic regions, etc.. just use the genes(), cds(), and other TxDb functions, and intersect them.
I'm just going to throw out an easy way to do this using the ChIPseeker R package from Bioconductor. You would first annotate your peaks, and then use the annoPie function to achieve your desired results automatically.
Thanks guys. I'm going to give Giovanni's R based solution a try later. I'm not that familiar with R but it looks easy and my colleague is going to help me. I'll let you know how it goes.
I'd download the GTF and use GenomicFeatures in R, but that's me.
no need to do that, just install the Homo.sapiens package from bioConductor. It's the same data.