Question

Why goseq tutorial calculates median for gene length not sum?!

1

Entering edit mode

10.1 years ago

Parham ★ 1.6k

Hi,

I wonder why for calculating gene lengths, they calculate the median of the transcripts in goseq manual? Its under section 5.3. I suppose it should be sum rather than median. The manual can be found here.

http://www.bioconductor.org/packages/release/bioc/vignettes/goseq/inst/doc/goseq.pdf

Cheers!

median goseq • 2.5k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 10.1 years ago by Parham ★ 1.6k

Ram · Accepted Answer · 2015-04-29

The goal is to determine if there's a bias by gene length. In order to do this, one needs to derive some sort of gene length measure. There are a couple ways to do that:

Union gene model: The total non-redundant exonic length of a gene
Estimated length: Derived from using expectation maximization, where you then have an estimate of the expected gene length within each sample
Median transcript length: What's used here, which is just the median of the annotated transcript lengths.

If you actually summed the transcripts, as you suggest, then you'd get odd results for genes with many isoforms. The result also wouldn't match any biologically plausible length (e.g., if a gene has 20 isoforms, each ~1kb then your method would yield a length of 20kb rather than a more reasonable ~1kb estimate).