Hello all,
I'm posting this because i've tried (almost) everything possible but haven't obtained reasonable results.
For the past couple of months, I've been trying to reproduce the results from a single-cell RNA-seq study by starting from the raw count matrices reported by the authors and performing all QC and downstream steps like normalization, clustering, and differential expression analysis. This works great and reproduces all results almost exactly.
However, when i start from the fastq files, quantify the read counts, and follow the exact same pipeline and methods used to reproduce the results initially, this gives widely different results. For example, some genes that are reported as differentially expressed initially are not in the new analysis.
I've looked at each step of the analysis pipeline in extreme detail, but haven't found anything different from what the authors report. I do not believe this would be an issue with the pipeline since there's a very slim change of obtaining the exact rank of differentially expressed genes with a faulty pipeline.
I thought this was a result of the version of cellranger used for quantification, but after using v3, v2 (what the authors used), and v7, i still can't reproduce the results. The only difference in the quantification step is the genome build. The authors used GRCh38.p5, while i've used one downloaded from 10x genomics (a 2020 build). I doubt this could cause such difference, and am truly at my wits end.
Can anyone help with possible solutions? or things they would consider.
TL:DR
I can reproduce authors work when i start with their count matrix, but not when i start with raw fastq, using the exact same pipeline. Any possible reason for this?
I sincerely appreciate it!
If the end result matches up when you start with their count matrices but not when using yours, how do the count matrices look when you compare them directly? My only advice would be keep trimming away at the complexity of the workflow to narrow it down. You could always gently ask them for more detail on methods... though in my experience you'll get a response like "great question, we'll get back to you!" followed by nothing. (Good on you trying to actually reproduce results though!)
Thanks a lot for your reply, @jesse. This is something I'll definitely consider. At the moment, only ways I can think of to compare the count matrices are average gene counts across groups/cell-types. Are there any other ways/tools to compare counts?
Not really my area, but just thinking out loud... if you're using the same pipeline your counts matrices should be covering the same information, right? Is it cell UMIs on rows and genes on columns or the like? Can you compare how any given cell is annotated in their matrix versus yours? Maybe something like a PCA (or maybe LDA, since the question is, "what separates these two groups") would help to summarize an all-to-all comparison?