Question

ALEXA-Seq: gene expression above noise level

6

Entering edit mode

11.3 years ago

elmbeech ▴ 70

Lately I was working with the RNA-seq data from a breast cancer cell line panel, which was generated with the ALEXA-seq pipeline.

I was fascinated by the available expressed 0/1 information for every gene. So I had a look at the 'Alternative expression analysis by RNA sequencing' paper and the supplementary information (Figures 5 and 6) . The method described to identify the status of expressed below or above intergenic and locus specific (intragenic) noise is, as far as I understood, based on the measured expression level of exon regions, silent intron regions, and silent intergenic regions.

I wonder if it is possible to adapt this method, so that it can be used generically on any kind of RNA-seq pipeline.

Key question thereby is, if a downloadable reference genome (e.g. Homo_sapiens.GRCh37.75.gtf.gz file at ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/ server) contains all the mentioned kind of genomic regions (exon, silent intron, and silent intergenic)? And further, how is one able to distinguish between these genomic regions?

Any insight is welcome! Thank you,

Elmar

alexa-seq gene rna-seq genome • 3.0k views

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.3 years ago by elmbeech ▴ 70

Ram · Answer 1 · 2014-04-28

In ALEXA-seq, a work that is now arguably deprecated by newer tools, we tried to classify features as 'expressed above background noise levels as follows' (refer to the ALEXA-seq manuscript and supplementary materials for more details):

We identified thousands of negative control intergenic regions of varying size throughout the genome. These regions were defined by subtracting out known or predicted genes as well as regions with any evidence of expression from mRNAs in genbank or ESTs in dbEST.
From the set of candidate negative controls, we chose a subset that are most representative of real genes with respect to size and GC content.
Using these as negative controls we chose the 95th percentile of expression values as an estimate of background noise that you might see from any region regardless of whether it was really expressed. i.e. a cutoff that has a 'rationale' behind it.
For splicing analysis, the problem is more complex. Say you have some evidence for expression of an intron or novel exon within a known gene. This region may have the same level of noise as any region in the genome. However, it will also have additional noise from expression actually occurring at that locus. You will have unprocessed RNA in your sample that will increase noise in all introns. You will also have stochastic splicing errors. These sources of noise will be correlated with expression level. The more actively transcribed the region, the higher the noise levels. Thus a single cutoff for all loci is inadvisable. For that reason we again chose negative control features, within genes this time, that again have no prior evidence of being expressed in known databases. We then plotted the expression of these controls against expression of the gene they reside within (see Supplementary Figure 5 for an example). We then fit a linear model to that data and used it to derive a sliding background noise cutoff on a gene-by-gene basis. That way a novel exon within a highly expressed locus has to pass a higher bar to be considered real than one in a lowly expressed locus.

If you want to dig into some of the code that implemented these concepts including the code to generate Supplementary Figure 5, you can look here at summarizeExpressionValues<">/a>, and here at alternativeExpressionDatabase

For a review of tools related to rna-seq expression and splicing analyses you might refer to these posts: