Salmon ~ Effective Length
1
0
Entering edit mode
6 months ago
chrisk • 0

Hello Biostars community,

A question regarding the effect of the fragment length distribution on Salmon's EffectiveLength computation, for TPM, based on our below situation.

  • 150 bp sequencing
  • paired end library
  • 90 percent of reads on genes are overlapping
  • negative inner distance reported by RSeQC (-150bp to -100bp)

In most sequencing libraries, fragmentation and sequencing are set to avoid overlapped PE reads, whereas ours is obviously not.

How do these 2 different scenarios above, affect the fragment length distribution calculations (assuming its impossible to calculate insert size when paired end reads do not overlap in an RNA library).

Thanks in advance, Chris

Salmon • 1.1k views
ADD COMMENT
1
Entering edit mode

assuming its impossible to calculate insert size when paired end reads do not overlap in an RNA library

If you have a reference available then paired-end sequencing allows one to estimate the length of the library fragment being sequenced by inferring how far apart the two reads map/align on the reference. One of the exceptions would be, if the library fragment captured a breakpoint (e.g. two ends map to two different chromosomes). In that case it is not possible to estimate insert size.

90 percent of reads on genes are overlapping

You have a "short" insert library. There is no solution for this specific issue except making a new prep/library.

ADD REPLY
1
Entering edit mode
6 months ago
Rob 6.9k

Salmon will compute the fragment length distribution based on the (probabilistically weighted) implied distance between the fragment ends given the mapping information. That is, in general, the statement "its impossible to calculate insert size when paired end reads do not overlap in an RNA library" isn't quite true. Consider a fragment that has a unique mapping in the transcriptome for both ends. In this case, even if the read ends do not overlap, we can use the alignment to infer the length of the original fragment prior to sequencing. It is the difference between the leftmost and rightmost mapped positions of the read ends on the reference sequence where they both map. Now, obviously, in the presence of read-mapping uncertainty (i.e. multimapping reads) this becomes more difficult. However, Salmon overcomes this by using the estimated transcript abundances as computed during the online phase of its inference algorithm to probabilistically update its estimate of the fragment length histogram based on all of the observed mappings of a fragment. Even without this complexity, the fragment length distribution is often relatively smooth and "well-behaved" and can thus be robustly estimated from a relatively small number of samples.

Anyway, regardless of whether or not the fragment ends overlap, Salmon will make use of the mappings of the fragment ends to the indexed transcriptome to infer the implied length of the underlying fragments, and will use these values (appropriately weighted) to estimate the fragment length distribution.

ADD COMMENT
0
Entering edit mode

Hi @Genomax and @Rob,

Thanks for the note and my apologies for the delay.

@Genomax, couldn't agree more in relation to the short insert library (and negative inner distance) as a central concern. We don't anticipate this going away in the future wet lab, so we are erring on the side of caution with regards to data assumptions.

@Rob, thanks very much for the detail. It is reassuring to know that due to the transcriptome alignments and paired-end reads, this is a win-win situation. We had a further think about your points, added with extra reading from the COMBINE-lab GitHub. Specifically, we found your information regarding the values --fldMean and --fldSD being used for prior parameters of a normal distribution which is then truncated on the left at 0 very helpful ( post 127 https://github.com/COMBINE-lab/salmon/issues/127 ).

We took a look at our fastq files, pre and post trimming. On average, the Median is lower than the mean (albeit by approx. 10) and a histogram shows a positive right skew (e.g. 1.032, when using "skewness" in the R "moments" package.) Are heavy tails adjusted/detected by Salmon, causing an update to the prior distribution and possibly transforming the gaussian probabilistic model? I hope that made some sense.

Would the positive skew have any sort of impact on TPM output, and would it be preferential to not use TPM for further analysis downstream?

Thanks in advance, Chris

Example Fragment Distribution

ADD REPLY
0
Entering edit mode

Note that TPMs from salmon are transcript level, so in any case they would typically not be used downstream, unless you want to analyze transcripts rather than genes. I would simply aggregate the counts to gene level with tximport and go along with that. Normalize with DESeq2 or edgeR, or some transformation such as vst. Use that downstream.

ADD REPLY
0
Entering edit mode

Hi chrisk

The non-normality of the final distribution shouldn't be a problem. Salmon only uses a truncated normal as the prior, but the final density is an empirical distribution, so it will conform to the observed fragment lengths. In fact, the multiQC tool has the ability to read the salmon output formats and will actually draw a nice histogram of the fragment length distribution as inferred by salmon.

To the second point about downstream analysis; in general, one wouldn't use the TPM for much quantitative analysis anyway, as TPM is a "within-sample" normalization technique and questions like differential expression will have to account for differences in library size. The tximport package, however, will take care of all of this for you, accounting for the effective transcript lengths (and even the fact that effective length for the same transcript and change between samples based on the fragment length distribution) and the library sizes of the different samples.

ADD REPLY
0
Entering edit mode

Hi @ATpoint and @Rob. Thank you for the added notes on transcript TPMs from Salmon @ATpoint. Yes we aggregate to the gene level with tximport, but remain interested in the effects a non-normal distribution would have on Salmon's effective length.

@Rob, for context, we are concerned how these non-normal distributions may be affecting our Salmon TPM output across samples (human plasma with high variance), and our goal of identifying differentially expressed genes using deseq2 and then building a binomial classifier with logistic regression.

Based on these issues, we are still deciding what count file would go into the classifier - would it be the default matrices from the tximport output when type="salmon" , or the output of the deseq2 transformed (vst or rlog), after identifying significant genes?

Also, circling back @Rob to your point on the non-normality of the final distribution - How non-normal does a distribution have to be for us to flag it before using Salmon? lol

With sd and mean key to a normal distribution, what parameters are used when the empirical distribution confirms a non-normal distribution exists. Any references would be great on how to sample from different distributions. Thanks in advance.

Fyi, we utilise the nf-core/rnaseq pipeline.

ADD REPLY
1
Entering edit mode

Hi chrisk,

So, to further clarify my explanation, Salmon makes no assumption about the normality of the final distribution. Rather, it fits a smoothed, empirical distribution of the observed fragment lengths. The resulting distribution need not be normal, and could even be e.g. bimodal, and Salmon will fit it appropriately. The only normality that arises in this process is that a truncated normal is the (weakly weighted) prior distribution. Once you’ve observed a few thousand mapped fragments or more, it will have essentially no effect on the output.

On the other hand, _if_ you observe a very strange output distribution (i.e. something multimodal) than this may be evidence of something strange or unexpected going on in your sample. There is substantial variation in the exact shape and skew of distributions in observed “successful” experiments, but if the observed fragment distribution is hugely out of whack with the theoretical expectation, then it _may_ be a sign that something was awry in the sample collection or processing up until this point. Unfortunately, I don’t know of a particular tool to check when the difference is “big enough” to raise concerns, so I would just look at e.g. the MultiQC reports after processing the data with Salmon. Of course, again, while looking at the empirical fragment length distributions, produced by Salmon, as visualized via MultiQC, may help to diagnose any specifically problematic samples, the fundamental issue will have been something related to the sample upstream, rather than any processing done with Salmon itself.

ADD REPLY

Login before adding your answer.

Traffic: 2053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6