Background
We observed some RNA-seq coverage in regions outside annotated genes - let's called them intergenic. This expression appears to be more pronounced, or unique, to a particular condition.
Goal
Find those genomic regions that have higher coverage than expected by random noise alone, along with a read count value (expression). We are looking to identify those regions with high resolution, but rather a broad overview to:
test whether or not there is a trend for more intergenic expression in some conditions;
intersect those expressed intergenic regions with other relevant genomic features.
Data
Paired-end, total RNA-seq, Vertebrate species, not human.
Possible strategies
A, näive:
- divide the genome in windows (size?)
- count reads per window
- remove regions containing genes +/- 5kb
- set background: randomly select X regions (1000) with 100 permutations to find distribution of background. Define cut-off as mean (or median + 2*SD).
- Use cut-off to select intergenic regions with high expression. Merge those within 1kb.
B, fancier following a histone mark-style approach:
- Use
csaw
to calculate coverage using sliding-window (size?) - remove bins containing genes +/- 5kb
- median coverage across those bins used to filter "expressed regions" (I could also use a permutation approach here)
Question(s)
- Does any of the above options sound reasonable for what I trying to accomplish?
- Is there some detail missing?
For the window sizes I was thinking about using the average size of exons, since using the size of transcripts could lead to really large windows. Also, if the expression is "transcript-like", short exons - variable length intron - it could lead to large discrepancies in the average coverage and some regions might be missed.
Would DERfinder be of use to you?
https://academic.oup.com/biostatistics/article/15/3/413/223630
https://github.com/alyssafrazee/derfinder
It just might. A bit more evolved than what I had in mind, but could give out extra useful information. I look into it, cheers.