I have a top-level question regarding the GUIDEseq analysis pipeline. Why is there a huge discrepancy between the number of on-/off-targets identified and the number of reads in the input data? Based on the principles of the library preparation it seems that many more reads would be usable.
For example, in the original paper the EMX1 target had ~11k reads reported as either on-/off-targets out of ~3.3 million reads - a usable rate of 0.3%. If we give a fudge-factor of 20% for consolidating the reads and accounting for the control it's still only 6% useful reads.
If one looks at the SAM file generated in the pipeline, it's conceivable that upwards of 50% of these reads should be considered "real", if we require a "real" hit to have an ODN, be at a site with <=8 mismatches to the gRNA, and be observed more than a few times.
Are there any GUIDEseq experts here who could shed some light on this?
I'll add that there have since been improvements to the protocol (ref1, ref2), but the numbers written above are more or less the same.