I want to create plots like the number of exons per gene, and the length of transcripts (such as this figure). My first choice would be to use R/Bioconductor and gtf/gff files, but I haven't found any good ways of doing this. Appreciate any suggestions!
I'd start by using GenomicFeatures to convert the GTF/GFF file into a TranscriptDB object. Let's say for example that you have human chromosome 21 GTF saved as the file chr21.gtf.
library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF(file = "~/Downloads/chr21.gtf", format = "gtf")
Now it is a case of working to understand the structure of the txdb object, so as you can figure out how to extract, aggregate and plot the data that it contains. To get you started, here's a histogram of exons per transcript something like panel Ab in your figure:
library(ggplot2)
# group the exons by transcript
exons <- exonsBy(txdb, by = "tx")
# make a data frame with transcripts and exon count
exons.tx <- data.frame(tx = 1:length(exons), exons = sapply(exons, length))
# plot
ggplot(exons.tx) + geom_histogram(aes(exons), fill = "blue") + theme_bw()
Thanks for the answer. I already tried this approach, but I don't think my gff-file is formatted correctly. I get the following message:
Warning message:
In .deduceExonRankings(exs, format = "gff") :
Infering Exon Rankings. If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
Error in unlist(mapply(.assignRankings, starts, strands)) :
error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in (function (starts, strands) :
Exon rank inference cannot accomodate trans-splicing.
Is this annotation a GFF3 file obtained from NCBI? We've had some problems recently with some of NCBI's annotation files, as the way they encode trans splicing is quite strange. I can't remember for sure, but this error message looks familiar and I think we encountered it looking at NCBI GFF3 files.
I think the problem is that I have genes located on different chromosomes/scaffolds. I am working on a non-model organism with a partly assembled genome. But what I did so far was just to read the gff-file as a tab-separated text file and subset on the word "exon". Then count the number of exons per gene.
The GenomeTools suite includes the ``gt stat`` command which can calculate a variety of length/count/score statistics. It does not generate plots per se, but a trivial Perl/Python script can convert the output to a tabular format that could easily be loaded into R for plotting.
Thanks, I'll check it out. I have used GenomeTools earlier and I remember there were some error messages with my gff-file, but maybe they can be sorted out.
My experience is that the GenomeTools GFF3 parser is very strict--in a good way. When I get an error from GenomeTools about my GFF3 file, >95% of the time it's a problem with the file.
Thanks for the answer. I already tried this approach, but I don't think my gff-file is formatted correctly. I get the following message:
Is this annotation a GFF3 file obtained from NCBI? We've had some problems recently with some of NCBI's annotation files, as the way they encode trans splicing is quite strange. I can't remember for sure, but this error message looks familiar and I think we encountered it looking at NCBI GFF3 files.
Note that your original question specifies GFF or GTF and my example uses GTF.
I think the problem is that I have genes located on different chromosomes/scaffolds. I am working on a non-model organism with a partly assembled genome. But what I did so far was just to read the gff-file as a tab-separated text file and subset on the word "exon". Then count the number of exons per gene.