Could someone help further clarify how kaliisto deals with alignment for intronic regions?
I have come across the following and am a bit unsure what is correct
"The pseudoalignment-to-transcriptome algorithms force intronic reads to map to spurious genes, resulting in hundreds false positive genes in each cell. "
"What you experience is an outcome of the way that kmer-based pseudoalignment works. A read is k-compatible with a target if all of the mappable k-mers from a read occur in that target. When you add the intergenic sequences then there might be k-mers that were not originally mappable, but now become mappable to the new intergenic sequences"
"In order to know which reads come from spliced as opposed to unspliced transcripts, we need to see whether the reads contain intronic sequences. Thus we need to include intronic sequences in the kallisto index"
I am trying to decide between starsolo and kallisto. In part, I think I am going to run both simultaneously and see what the data looks like. However, I am dealing in single-nuc and want to make sure I have a clear understanding of how intronic reads are dealt with.
Additionally, any additional info on what kinds of transcripts one would expect from nuclei vs say the endoplasmic reticulum and how they vary in antisense reads/lncrna, as well as other variation and just an overview of what can/can't be detected from just single nuc vs single cell samples. Would really appreciate some added info on the significance of this variation and any additional info that may be helpful
Yes, these quotes are from the recent STARsolo preprint. Why don't you simply use either STARsolo itself, or Alevin using a combined exonic- and intronic index and the full genome decoy? That should much better take care of these spurious mappings.