In the Trinity manual, I read the following:
--max_pct_stdev <int> :maximum pct of mean for stdev of kmer coverage across read (default: 200)
What does this mean in more detail? I found an older discussion:
http://ivory.idyll.org/blog/trinity-in-silico-normalize.htmlwhere they explain it as:
Here, the per-read pct_dev is defined as the deviation in k-mer coverage divided by the average k-mer coverage, times 100 (to make it a percent). If the deviation is high, that indicates that the read is likely to contain many errors, since high-coverage reads with low-coverage k-mers shouldn't happen. Trinity sets a cutoff of 100: if the deviation is as big as the average, the read should go away
So is max_pct_stdev just std kmer coverage / avg kmer coverage * 100?
Does this mean that a high value for this statistic mean that there are some kmers with really low kmer coverage (thus increasing the std by a lot) compared with the average and assuming a generally large sequence coverage, these reads are probably bad? Or were do we get the "high-coverage read" from?
Thank you for your detailed explanation. Have I understood the equation ("std kmer coverage / avg kmer coverage * 100") correctly?
That seems right. Think of it as "what percent of the mean k-mer coverage can the k-mer coverage sd be, at max?"