How does Salmon bootstrapping work?
2
2
Entering edit mode
2.7 years ago
scsc185 ▴ 80

I am trying to understand how salmon creates bootstrap samples. I read the original salmon paper, and in the paper, it says the following.

The bootstrap sampling process works by sampling (with replacement) counts for each equivalence class, and then re-running the offline inference procedure (either the EM or VBEM algorithm) for each bootstrap sample.

What I understand right now:

  1. Bootstrap means sampling with replacement.
  2. An equivalence class is a set of transcripts that a given read can possibly come from.
  3. Read counts are how many reads come from a specific transcript.
  4. EM algorithm in the context of HMM models.

But I am having trouble putting these concepts together to understand the sentence above from the paper. Specifically,

  1. How do you sample read counts? Is it sampling the reads that come from a given transcript?
  2. What's the role of EM algorithm here? What parameters are being maximized here?

Thank you!

RNA-seq salmon bootstrapping • 2.2k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
4
Entering edit mode
2.6 years ago
Rob 6.9k

Hi scsc185,

During it's streaming estimation phase, salmon partitions mapped fragments into range-factorized equivalence classes — which groups together fragments that map to the same set of transcripts with the same approximate conditional probabilities.

During offline estimation (the VBEM or EM, depending on the flags you pass to salmon), the counts of these equivalence classes of fragments, and the conditional probabilities that induced them, are treated as sufficient statistics, at which point a standard bootstrap procedure is run. Consider A(X) to be the estimation algorithm (e.g. VBEM or EM) on the vector of range-factorized equivalence class counts X. Each bootstrap replicate is then generated according to the process for the standard bootstrap procedure) — that is, a draw X' ~ X is made with replacement — treating the count vector X as an empirical distribution a new X' is generated by sampling counts according to X. Then the bootstrap estimate B' = A(X') — that is, the estimation algorithm is simply re-run on X'. If you request 50 bootstrap samples, this is done 50 times. Each time a new X' is drawn, it will be from the distribution of X, but slightly different. This lets salmon assess how differences in the sampling variability (which specific X' is drawn) affect the overall abundance estimates. The result, when looking across the set of bootstrap replicates, is an estimate of the variance (or uncertainty) in the point estimate for the abundance of each transcript.

With respect to the specifics of your questions:

How do you sample read counts? Is it sampling the reads that come from a given transcript?

So, what is being sampled are counts for range-factorized equivalence classes. That is, counts are being drawn, distributed according to the empirical distribution, such that they have the same pattern of multi mapping and conditional probabilities as the original fragments. This works because the counts from the range-factorized equivalence classes act as the sufficient statistics of the inference procedure, so one can bootstrap over those directly rather than the more traditional approach (as taken in e.g. IsoEM) of bootstrapping the actual reads themselves.

What's the role of EM algorithm here? What parameters are being maximized here?

Most probabilistic quantification tools maximize a likelihood that looks like L(θ; F, T), where θ are the parameters. You can view θ as a vector, equal in length to the number of annotated transcripts, that measures the estimated nucleotide fraction of each transcript in the annotation in the current sample. The likelihood seeks to find the θ that maximize the likelihood of the observed data F (the set of sequenced fragments) over the set of annotated transcripts T. Since the range-factorized equivalence class counts are the sufficient statistics, we have that L(θ; F, T) ∝ L(θ; X, T). Thus, for each bootstrap replicate X', we seek to maximize L(θ; X', T).

For the Gibbs Sampling, salmon uses a modified version of the 2-step Gibbs chain as explained in the MMSEQ paper. This is similar in spirit to the bootstrap, but is a Gibbs sampler over the posterior distribution of abundance estimates for the transcripts.

ADD COMMENT
0
Entering edit mode

I am not sure if I am understanding this correctly. An equivalence class is a list of reads that map to the same set of transcripts. Let's say, in class A, we have 5 reads. That means there are 5 reads that map to this set of transcripts, i.e. the read count is 5. But we have many classes, so we will have a count vector that includes all the classes. The items in a count vector correspond to the size of each equivalence class. During bootstrapping, we create a new count vector by sampling the old count vector with replacement. For every newly generated count vector X', EM is run to maximize L(θ; X', T), where θ measures the estimated nucleotide fraction of each transcript in the set of annotated transcripts T.

  1. What is the estimated nucleotide fraction of each transcript?
  2. What happens after θ is maximized? Do we use it to generate a new count vector? But I thought we would just directly sample from the old count vector with replacement...?
ADD REPLY
1
Entering edit mode

The range-factorized equivalence classes also account for the conditional probabilities with which reads arise from transcripts, but let's table that for the sake of exposition here. So your description is correct. An equivalence class can be characterized by a set of transcripts (often called the label) and a count. For example, if we observe 10 reads that map to transcripts A, B, and C, and 8 reads that map to C and D, we will have two equivalence classes:

eq_1 = {A, B, C} : 15
eq_2 = {C, D} : 8

where thee set gives the labels and the number after the : the number of such reads. Since the actual labels of the equivalence classes will remain fixed during all rounds of bootstrapping, we can think of our count vector as X = (15, 8). So this can be seen to represent an empirical distribution from which we can sample with replacement. In practice to generate X', we will sample 23 times according to the categorical distribution (15/23, 8/23). Each sampling becomes a new vector X' on which we will seek to maximize L(θ; X', T) using the optimization algorithm:

What is the estimated nucleotide fraction of each transcript?

This is a vector, of length equal to the number of annotated transcripts (say M), that sums to one. The interpretation is that if θ_i = p, then p fraction of the sequenced nucleotides (e.g. p fraction of the reads) came from transcript annotation i. This is the fundamental parameter of the statistical generative model — it can be then used to generate TPMs ( TPM_i = 10^6 * (θ_i / l_i) / [\sum_j (θ_j / l_j)] or to assign estimated read counts or any number of related quantities used for analysis.

What happens after θ is maximized? Do we use it to generate a new count vector? But I thought we would just directly sample from the old count vector with replacement...?

As described above, after θ is maximized, it is used to determine an estimated read count for each transcript for this particular bootstrap sample. If we bootstrap say k times, we will have, for each transcript, a vector of k+1 counts — the initial point estimate of salmon, and one extra estimated count for each bootstrap replicate. We can then look at the variability of this length k+1 vector to estimate the uncertainty we have — due to the random sampling of sequencing and the statistical estimation to resolve multimapping reads — in the abundance of this transcript.

This can be used in several downstream applications. For example, swish will use these estimates in each replicate to enable a non-parametric test for differential expression that accounts not only for the biological expression variability of a transcript within each condition, but also this added uncertainty due to statistical inference. As can be seen in the swish paper itself, this can sometimes improve the results of differential expression testing at the transcript level.

ADD REPLY
2
Entering edit mode

This is a great explanation- Thank you! It makes so much more sense now.

ADD REPLY
1
Entering edit mode
2.7 years ago
dsull ★ 7.0k

For the EM algorithm (and the objective function that is to be maximized), see: https://bioinfo.iric.ca/understanding-how-kallisto-works/

For bootstrapping, imagine each equivalence class (EC) is a different colored ball (e.g. EC 1 = red, EC 2 = blue, EC 3 = green, etc.) and you have 10 red balls, 4 blue balls, 3 green balls, etc. You basically sample your balls with replacement and then, with your bootstrap sample, you try to maximize the likelihood function.

ADD COMMENT

Login before adding your answer.

Traffic: 1514 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6