Hey everyone,
I've been having a hard time understanding an issue I've been having with a certain Placental data set I've been running STARsolo (ver. 2.7.9a) on. I am following the same methods as the paper, which aligns the samples to the Human reference genome GRCh38 (ver. 2020-A, CellRanger) using STARsolo, with the parameter --soloFeatures GeneFull
. Even after following these from the paper, I have not been getting the same reads as the metadata supplied from the paper. I ran STARsolo using the following parameters...
STAR --runThreadN 8 --genomeDir reference --readFilesIn 1.fastq.gz 2.fastq.gz --readFilesCommand gunzip -c --soloCBwhitelist february-2018.txt --soloType CB_UMI_Simple --soloFeatures GeneFull --soloUMIlen 12 --soloCBlen 16 --soloUMIstart 17 --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ./output/
The metadata supplied suggests that I should be getting a total number of reads of 874,963,983. However, I'm only getting 792,208,037 reads. In addition to this, nothing else from the same sample is shared between my run and the papers. Is this common when running STARsolo? I think I've gone pretty straight-forward with the parameters, so I'm not sure what would cause such a huge discrepancy in results. If anyone has any idea, I appreciate it. Thanks.
Many (most?) NGS data aligners produce non-deterministic output. Unless they provide a seed or an option to create deterministic output. Using multi-threading can also affect this. Are you using the same exact version of
STAR
(and command line options) as in the paper? All of these difference could add up to the results you are observing.Unfortunately, they don't specify any parameters apart from using
GeneFull
, and the reference genome. Other than this, every parameter is an assumption, but I completely understand. Pursuing the difference in outputs, would there be a noticeable difference between data in downstream analysis?What do you mean by this? This might be more significant than ~9% read count difference.
True! I didn't want to include it all in the original post but all the results are different somewhat:
Mine:
Paper:
The difference probably accounts for the reads... just interesting. From the above response, I'm assuming it's okay to move forward, as getting an exact output would be difficult without the exact parameters!
I see, yeah it looks all related to number of reads. I thought you might have meant downstream analysis wasn't replicating either!
I agree replicating exact output may be difficult because it can depend on details, but the main conclusions should be consistent. If they aren't, then that's definitely the time to start asking questions.