Dear community,
Over the past years, I have collected several scRNA-seq datasets that have been sequenced multiple times, i.e. the same DNA sample with the final library was sequenced in multiple separate Illumina runs.
One interesting phenomenon that I can observe over and over is that a shallow sequencing run is not a random subsample of a deep seqeuencing run. What I mean is, say, I have two sequencing runs of the same DNA library, one with 10 000 reads/cell, another with 40 000 reads/cell. If I take the 40 000 reads/cell dataset and randomly subsample it to 10 000 reads/cell, it will be qualitatively different from the dataset that was actually sequenced at 10 000 reads/cell. More specifically shallow sequencing seems to underestimate library complexity, i.e. the actually shallow dataset will have fewer unique cell-gene-umi combinations than the subsampled dataset. I guess this can only mean a sequencing bias, i.e. the sequencer is preferentially processing certain DNA fragments. I could identify a slight bias towards shorter fragments (fragment length estimated based on the expected position of the 3' end of a gene), but I'm not certain whether this can explain the entire extent of this phenomenon.
I was unable to find any literature on this topic. All the articles investigating optimal sequencing depth in single-cell datasets I found used random subsampling of a deeply sequenced library to simulate shallow sequencing. Are you aware of any work that would discuss this phenomenon? Ideally, I would like find some tool which would subsample a deeply-sequenced dataset in a way aware of this phenomenon, so that the subsampled dataset would resemble a true shallow-coverage dataset.
Can you give us some additional details. How were the libraries stored? How far apart was the sequencing? Was it done using the same chemistry and/or sequencer?
10x support told us a while back that shallow sequencing (e.g. a MiSeq nano run with a 1M reads) was at best to be used only "qualitatively" for checking library quality.
I have multiple different examples - 10X and BD Rhapsody libraries, MiSeq+NovaSeq, MiSeq + NextSeq 2000, NovaSeq SP + NovaSeq S4, NovaSeq SP + NovaSeqX. Storage can be ruled out as a reason, because the deeply sequenced (=better) library was always sequenced later. So storage would have to improve the library quality :-)