Entering edit mode
5.1 years ago
Eric Lim
★
2.2k
Our routine QC procedures include using fastqc to estimate duplicate reads. Some recently added datasets caught my attention. We noticed a subset of these samples have wildly different estimated duplicated reads in each end. What could be the issue here?
A related post: High level of duplicate in one reads of paired-end data
Dup | R1 | R2 |
---|---|---|
1 | 73.30% | 38.50% |
2 | 72.50% | 42.80% |
3 | 72.40% | 40.00% |
4 | 71.90% | 40.60% |
5 | 71.90% | 39.50% |
Is there anything special about these samples from the wetlab part? Which kit was used and what species is that?
Are these from patterned flowcells (Hiseq 4000/NovaSeq)?
They're relatively older dataset from HiSeq 2000. While patterned flowcells tend to generate more dupes, I'm not sure what is so special about R2?
Added: We looked into other fastq parameters (GC contents, adapters, overrepresented sequences, etc) and post-alignment (mapping, skewness, insert sizes, etc), everything else seems normal compared to the rest of the samples in the experiment.
Are read numbers identical for R1/R2? Off chance that some reads in R2 were filtered out by some processing.
Is there a quality drop in R2? I've seem large differences in duplicate estimates when R2 is of much lower quality than R1 - the lower duplicate estimates is just an artifact caused by sequencing errors.
BBMap has a nice feature which can help in your situation, the mhist parameter:
@ATpoint, @genomax, and @h.mom
The data are from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47966 and the discrepancy in estimated duplicates were found in 6 of the 12 human rnaseq samples. Given all the parameters I've looked at, my best guesstimate is that these 12 samples were sequenced in 2 batches. Initially, they probably had n=1 for each of the 6 developmental time points, but reviewers probably asked for replicates, so they sequenced again and published n=2 as technical replicates. I assume the RNAs might've been degraded a bit. Despite HiSeq 2000 was reported as the platform, the second batch might've been sequenced on a different platform.
@h.mom, the 2nd batch produced noticeably lower quality overall, but the quality difference in R1 and R2 seems insignificant. The design for this experiment is 101bp forward and 99bp reverse. I'll try what you suggested to see if BBMap will shed some light.
@genomax, no. While we do some pre-processing internally before alignment, QC was run before those filtering. I also manually check the read numbers and they match.
Whatever it is, I've decided to drop these 6 samples for now. Other than to satisfy my own curiosity, figuring out what happens to the discrepancy is not prioritized at the moment.
Hi @ericlim. Were u able to find out the reason behind this discrepancy. I also have a data like this (however it's Selective whole genome amplification). The R1 have more duplicates than R2 in fastqc both pre and post primming.
No clue. I actually haven't thought much about this since I wrote the summary. We've substantially improved our QC pipeline. Will ask the team to include these samples back and see if we notice anything new.
Thanks. I shall be waiting for your input. No one has answered this behaviour so far.