What is the effect of setting "--fragment-length" (-l) either too low or too high for Kallisto single-end quantification and how could this affect your conclusions?
I've seen some variations of this question on here, but not a clear explanation for what the fragment length argument (which is required for running Kallisto on single-end data) actually does and how it affects quantification.
Some background/motivation: As others have noted, it is sometimes difficult to determine what this length should be, especially for data that you didn't generate yourself. I'm not sure how I feel about guessing a number for it due to previous experience because I've seen some very different results from quantification resulting from changing fragment length which I'm not sure how to think about. For example, in one experiment where there was ribosomal RNA contamination in an NEB-Next library, using the bioanalyzer-derived fragment size (~300 bp) led to an rRNA contamination content of about 10% of the reads, but setting the fragment length to 1 led to a contamination content of about 30%, more similar to what I got using bowtie mapping of the reads. How is this happening?
The reason I tried -l 1 is because I found a paper using QuantSeq data which used a fragment length of 1 (-l 1 -s 1) on their data, but I can no longer find that reference unfortunately. What is the effect of setting the fragment length to 1 and could this lead to incorrect conclusions?
This may be of interest: How to choose parameters for kallisto single end mode?
Thanks, I've already seen this post (and many others), but it doesn't answer my question. I'm not asking how to choose, I'm asking what does it do. How is fragment length actually used by the software and what is the effect of choosing different fragment lengths to analyze the same data? I will try to find some time to day to do a bit of analysis on this and post it here, but ideally, I'm hoping someone can explain the actual algorithmic use of the fragment length in Kallisto.
In single-end mode, in addition to determining how transcript length should be adjusted to arrive at the effective length, the
--fragment-length
parameter determines which pseudoalignments should be allowed and which should be discarded.The
-l
and-s
parameters define the expected distribution of fragment lengths, with the value passed to-l
being the mean of this distribution. Let μ be the user-provided mean of this distribution. When mapping reads in single-end mode, kallisto will discard alignments where the mapping position of the read is < μ bases from the end of the transcript. In this case, the model says that the probability of choosing a fragment of length < μ is < 1/2, and so this alignment is discarded from consideration. If all pseudoalignments of the read map in a manner like this, the read goes unmapped. This means that the value provided to this parameter affects not only the effective transcript lengths, but the actual pseudoalignments considered during quantification.I think this is right, though I don't quite understand what you mean by the probability of choosing a fragment length µ < is < 1/2 part. I thinking that they must calculate the probability of the read matching the transcript based on both the -l and the -s parameters and some kind of Gaussian derived probability? But you are definitely right that the -l parameter is not only about the effective length, but also about the pseudoalignment! I wasn't sure about this, which is part of why I asked the question. I did some empirical testing now which I'll post.
-l and -s are both required to generate a Gaussian distribution, which kallisto then truncates (e.g. we get rid of everything in the distribution over 1000 bp long) before calculating the mean of the truncated distribution.
You can disable the feature where kallisto discards pseudoalignments based on fragment lengths (it's disabled in "kallisto bus" mode for bustools and it can be disabled in "kallisto quant" via the --single-overhang option).
dsull may be able to address this specific request. That sort of detail may require a dive into code (if you are able to).