Hi guys,
I'm working with a mutant that is supposed to show gene-specific splicing defect (1), readthrough (2) and spurious transcription (3). From paired-end strand-specific RNA-seq data, what would you do to identify the affected loci ? How would you communicate (in terms of figures) the results ?
I was thinking running DESeq on introns (1) and 5'/3' flanking sequences (2) while keeping the normalization on the exon read count. Or maybe just use the ratio (reads mapping on intron)/(reads mapping on exons)
What do you think of this approach ?
I'd really appreciate any input on this ! Thanks in advance !
Carlo
This sounds like a reasonable approach. I had an issue where an RNA-seq protocol seemed to deplete coverage of a small number of genes, but only over the coding sequence (approximately). I was able to come up with a good list of affected genes just by comparing my bam files to a customized gff file with separate "gene" annotations for 5' UTR, CDS, and 3'UTR, using bedtools intersect. The ratios of reads in these three feature categories, when matched up per-gene, was sufficient to find affected genes (assuming I didn't miss a whole class).
I wonder if Cufflinks might also be effective, in its prediction of novel transcripts? You might expect a "spike" in novel transcript predictions, or fpkm for those novel transcripts. This is probably overkill (computation time-wise).
Thanks for your input and suggestion ! Once you had the ratios, did you just use a treshold to separate true affected transcript from noise ?
I'll look into Cufflinks too.
I just filtered for a minimum read count per kb of my 5'UTR, then sorted by ratio. I bet a similar approach would work well.