Hi scsc185,
During it's streaming estimation phase, salmon partitions mapped fragments into range-factorized equivalence classes — which groups together fragments that map to the same set of transcripts with the same approximate conditional probabilities.
During offline estimation (the VBEM or EM, depending on the flags you pass to salmon), the counts of these equivalence classes of fragments, and the conditional probabilities that induced them, are treated as sufficient statistics, at which point a standard bootstrap procedure is run. Consider A(X) to be the estimation algorithm (e.g. VBEM or EM) on the vector of range-factorized equivalence class counts X. Each bootstrap replicate is then generated according to the process for the standard bootstrap procedure) — that is, a draw X' ~ X is made with replacement — treating the count vector X as an empirical distribution a new X' is generated by sampling counts according to X. Then the bootstrap estimate B' = A(X') — that is, the estimation algorithm is simply re-run on X'. If you request 50 bootstrap samples, this is done 50 times. Each time a new X' is drawn, it will be from the distribution of X, but slightly different. This lets salmon assess how differences in the sampling variability (which specific X' is drawn) affect the overall abundance estimates. The result, when looking across the set of bootstrap replicates, is an estimate of the variance (or uncertainty) in the point estimate for the abundance of each transcript.
With respect to the specifics of your questions:
How do you sample read counts? Is it sampling the reads that come from a given transcript?
So, what is being sampled are counts for range-factorized equivalence classes. That is, counts are being drawn, distributed according to the empirical distribution, such that they have the same pattern of multi mapping and conditional probabilities as the original fragments. This works because the counts from the range-factorized equivalence classes act as the sufficient statistics of the inference procedure, so one can bootstrap over those directly rather than the more traditional approach (as taken in e.g. IsoEM) of bootstrapping the actual reads themselves.
What's the role of EM algorithm here? What parameters are being maximized here?
Most probabilistic quantification tools maximize a likelihood that looks like L(θ; F, T), where θ are the parameters. You can view θ as
a vector, equal in length to the number of annotated transcripts, that measures the estimated nucleotide fraction of each transcript in the annotation in the current sample. The likelihood seeks to find the θ that maximize the likelihood of the observed data F (the set of sequenced fragments) over the set of annotated transcripts T. Since the range-factorized equivalence class counts are the sufficient statistics, we have that L(θ; F, T) ∝ L(θ; X, T). Thus, for each bootstrap replicate X', we seek to maximize L(θ; X', T).
For the Gibbs Sampling, salmon uses a modified version of the 2-step Gibbs chain as explained in the MMSEQ paper. This is similar in spirit to the bootstrap, but is a Gibbs sampler over the posterior distribution of abundance estimates for the transcripts.
Gibbs sampling/bootstrap replicates in Salmon