Plot statistics from gtf/gff file
2
4
Entering edit mode
10.2 years ago
jon.brate ▴ 310

Hi,

I want to create plots like the number of exons per gene, and the length of transcripts (such as this figure). My first choice would be to use R/Bioconductor and gtf/gff files, but I haven't found any good ways of doing this. Appreciate any suggestions!

gtf exons annotation gff plot • 12k views
ADD COMMENT
6
Entering edit mode
10.2 years ago
Neilfws 49k

I'd start by using GenomicFeatures to convert the GTF/GFF file into a TranscriptDB object. Let's say for example that you have human chromosome 21 GTF saved as the file chr21.gtf.

library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF(file = "~/Downloads/chr21.gtf", format = "gtf")

Now it is a case of working to understand the structure of the txdb object, so as you can figure out how to extract, aggregate and plot the data that it contains. To get you started, here's a histogram of exons per transcript something like panel Ab in your figure:

library(ggplot2)
# group the exons by transcript
exons <- exonsBy(txdb, by = "tx")
# make a data frame with transcripts and exon count
exons.tx <- data.frame(tx = 1:length(exons), exons = sapply(exons, length))
# plot
ggplot(exons.tx) + geom_histogram(aes(exons), fill = "blue") + theme_bw()

ADD COMMENT
0
Entering edit mode

Thanks for the answer. I already tried this approach, but I don't think my gff-file is formatted correctly. I get the following message:

Warning message:
In .deduceExonRankings(exs, format = "gff") :
  Infering Exon Rankings.  If this is not what you expected, then please be sure that you have provided a valid attribute for exonRankAttributeName
Error in unlist(mapply(.assignRankings, starts, strands)) :
  error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in (function (starts, strands)  :
  Exon rank inference cannot accomodate trans-splicing.
ADD REPLY
0
Entering edit mode

Is this annotation a GFF3 file obtained from NCBI? We've had some problems recently with some of NCBI's annotation files, as the way they encode trans splicing is quite strange. I can't remember for sure, but this error message looks familiar and I think we encountered it looking at NCBI GFF3 files.

ADD REPLY
0
Entering edit mode

Note that your original question specifies GFF or GTF and my example uses GTF.

ADD REPLY
0
Entering edit mode

I think the problem is that I have genes located on different chromosomes/scaffolds. I am working on a non-model organism with a partly assembled genome. But what I did so far was just to read the gff-file as a tab-separated text file and subset on the word "exon". Then count the number of exons per gene.

ADD REPLY
4
Entering edit mode
10.2 years ago

The GenomeTools suite includes the ``gt stat`` command which can calculate a variety of length/count/score statistics. It does not generate plots per se, but a trivial Perl/Python script can convert the output to a tabular format that could easily be loaded into R for plotting.

Here is an example invocation.

gt gff3 -sort -tidy ref_Amel_4.5_top_level.gff3 | gt stat -genelengthdistri -o amel-genelengthdist.txt

To see all available options, run gt stat -help.

ADD COMMENT
0
Entering edit mode

Thanks, I'll check it out. I have used GenomeTools earlier and I remember there were some error messages with my gff-file, but maybe they can be sorted out.

ADD REPLY
0
Entering edit mode

My experience is that the GenomeTools GFF3 parser is very strict--in a good way. When I get an error from GenomeTools about my GFF3 file, >95% of the time it's a problem with the file.

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6