Question

What is the Trinity setting max_pct_stdev?

0

Entering edit mode

10.3 years ago

Ekarl2 ▴ 120

In the Trinity manual, I read the following:

--max_pct_stdev <int> :maximum pct of mean for stdev of kmer coverage across read (default: 200)

What does this mean in more detail? I found an older discussion:

http://ivory.idyll.org/blog/trinity-in-silico-normalize.html

where they explain it as:

Here, the per-read pct_dev is defined as the deviation in k-mer coverage divided by the average k-mer coverage, times 100 (to make it a percent). If the deviation is high, that indicates that the read is likely to contain many errors, since high-coverage reads with low-coverage k-mers shouldn't happen. Trinity sets a cutoff of 100: if the deviation is as big as the average, the read should go away

So is max_pct_stdev just std kmer coverage / avg kmer coverage * 100?

Does this mean that a high value for this statistic mean that there are some kmers with really low kmer coverage (thus increasing the std by a lot) compared with the average and assuming a generally large sequence coverage, these reads are probably bad? Or were do we get the "high-coverage read" from?

trinity RNA-Seq max_pct_stdev • 2.5k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Ekarl2 ▴ 120

score 2 · Accepted Answer · 2015-04-14

2

Entering edit mode

10.3 years ago

Ram 45k

From what I understand:

For a 100b read with coverage=30, ideally, all 25-mers from it should ideally be covered around 30X. (this is just an example with random values)

Though I am not entirely sure why that assumption is made, it seems to be a rare case where one encounters k-mers with orders of magnitude higher coverage than the read they are from. However, it is entirely possible, owing to sequencing errors, that a kmer has really low coverage compared to the read. If a read has multiple such erroneous k-mers, distributed across the read, it would increase the STD DEV in the set of kmer coverage values but may not filter out the read itself at QC. Such a read can be considered suboptimal and discarded without significant loss to the assembly process.

ADD COMMENT • link 4.4 years ago by Ram 45k

0

Entering edit mode

Thank you for your detailed explanation. Have I understood the equation ("std kmer coverage / avg kmer coverage * 100") correctly?

ADD REPLY • link 10.3 years ago by Ekarl2 ▴ 120

1

Entering edit mode

That seems right. Think of it as "what percent of the mean k-mer coverage can the k-mer coverage sd be, at max?"

ADD REPLY • link 3.1 years ago by Ram 45k