Question

Normalization of single cell RNA-seq data with different read depths

3

Entering edit mode

7.0 years ago

mariah ▴ 30

I have a single cell RNAseq dataset that is based off of three unique biological samples. For each of these samples, the number of reads/cell is highly variable (30x difference). My current understanding is that the best approach to normalizing this data so that the samples could be compared would be to subsample the reads from the higher reads/cell sample so that the depth is equivalent to the lower reads/cell. This approach loses a lot of data.

I could also apply a normalization factor to account for sequencing depth, but I am concerned that because of the significant difference between the reads/cells, the drop-out events in the lower reads/cell sample would lead to a biased comparison.

Is there an option where my comparison would not be biased or lose vast quantities of data?

This normalization is in preparation for clustering and DEG analysis.

Thank you for your input!

RNA-Seq scRNA-Seq • 5.7k views

ADD COMMENT • link updated 6.2 years ago by wt215 • 0 • written 7.0 years ago by mariah ▴ 30

0

Entering edit mode

Were your biological samples each sequenced separately, i.e., is the library size effect confounded with the different conditions you wish to compare?

I'm not a big fan of subsampling, precisely because of the data loss, but you will have to filter quite stringently nevertheless. E.g., you could only focus on genes that are covered in 80% of all cells (across all conditions) or the like.

As far as normalizations go -- the paper by Vallejos et al. has a very good overview of the different problems and possible solutions.

ADD REPLY • link 6.8 years ago by Friederike 9.0k

0

Entering edit mode

Hi,

I faced the same problem, plz consider the pics; This is PCA analysis on rlog normalised single cell seq data (I randomly selected 70 cells per time point, pooled the reads and obtained mean expression in two replications)

https://ibb.co/fRWbkd

This is PCA of rlog normalised for bulk RNA-seq with the same time points and experimental design

https://ibb.co/eHgnrJ

This is PCA of a merged matrix of pooled single seq data matrix and bulk RNA-seq data matrix

https://ibb.co/h2GRkd

As your are considering, when I merged matrices of pooled single cell seq data with bulk RNA-seq, the data seem completely different, while they are from the same organism, time points, experimental design and whatever. Likely, Single cell seq data are not correct data if not the merged matrix should show a trend during the development like which each PCA shows separately. May you please give me some advices in ways I could process single cell seq data to prove that my data are correct?

ADD REPLY • link 6.6 years ago by Za ▴ 140

1

Entering edit mode

I have no idea what you're asking.

There is no way to actually "prove" that your data is correct. Your data is what it is.

Are you trying to say that when you treat your scRNA-seq data as bulk RNA-seq data you see different patterns? I'm absolutely not surprised that a PCA will separate samples based on the type of experiment, you'd probably see that for two bulk RNA-seq experiments done with different library preparations. As long as you are not trying to do differential gene expression on bulk vs. sc, I don't see how that would matter.

A more meaningful approach might be to take the 50 (or 100 or 200) most strongly expressed genes as defined by the bulk RNA-seq and see what their expression ranks are in the scRNA-seq. This should give you a good impression of how many strongly expressed transcripts you were able to capture (and how consistently you were able to capture them).

ADD REPLY • link 6.6 years ago by Friederike 9.0k

0

Entering edit mode

Thanks a lot, I was trying to make PCA with a merged matrix of bulk and pooled sc (14000 genes and 32 samples) good as a PCA I had made by each of data independently, then I must find a way to normalise my data so that a merged matrix of bulk and sc shows a good PCA.

ADD REPLY • link 6.6 years ago by Za ▴ 140

1

Entering edit mode

sure, you can try that, I wouldn't get my hopes up too high on that though.

ADD REPLY • link 6.6 years ago by Friederike 9.0k

0

Entering edit mode

You are alright because I am just getting non sense outcomes

ADD REPLY • link 6.6 years ago by Za ▴ 140

1

Entering edit mode

I don't think this should have you very worried. It just means that there's lot a variation introduced by the different types of experiments, which is really not a big surprise. You may have more luck looking at subsequent eigenvectors, but again, I don't think that a PCA on the merged data set is the most effective means of addressing the question at hand.

If I understand correctly, you basically want to make sure that you're drawing similar conclusions/seeing similar patterns in both bulk and scRNA-seq? Then do similar analyses and you should get similar genes popping up.

ADD REPLY • link 6.6 years ago by Friederike 9.0k

0

Entering edit mode

Thank you very much once again, I think YES. As PCAs are showing, either in bulk or a pooled sc (I randomly pooled 70 cells two times to make a matrix similar to my bulk data), we see a sense trend toward the development (2, 4, 6, 8, 10, 12, 14, 16 time points). I am just wondering, why when I merged two matrices (bulk + pooled sc), PCA does not show any trend toward the development? might be I must remove problematic genes before merging matrices or I use a normalisation method beforehand

However thanks a lot for considering my problem

ADD REPLY • link 6.6 years ago by Za ▴ 140

score 0 · Answer 1 · 2018-10-04

Hi mariah,

For dealing with different sequencing depths, you can have a try SCnorm, bayNorm or SAVER. The first two have been proved to be good at correcting different sequencing depths. The benchmark is to look at the number of DE genes called. Both methods have resulted in very low false positive rates.

Cheers, Wenhao