Same sample and reference genome, different number of reads
0
0
Entering edit mode
5 months ago
Nicholas ▴ 10

Hey everyone,

I've been having a hard time understanding an issue I've been having with a certain Placental data set I've been running STARsolo (ver. 2.7.9a) on. I am following the same methods as the paper, which aligns the samples to the Human reference genome GRCh38 (ver. 2020-A, CellRanger) using STARsolo, with the parameter --soloFeatures GeneFull. Even after following these from the paper, I have not been getting the same reads as the metadata supplied from the paper. I ran STARsolo using the following parameters...

STAR --runThreadN 8 --genomeDir reference --readFilesIn 1.fastq.gz 2.fastq.gz --readFilesCommand gunzip -c --soloCBwhitelist february-2018.txt --soloType CB_UMI_Simple --soloFeatures GeneFull --soloUMIlen 12 --soloCBlen 16 --soloUMIstart 17 --outSAMattributes NH HI nM AS CR UR CB UB GX GN sS sQ sM --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ./output/

The metadata supplied suggests that I should be getting a total number of reads of 874,963,983. However, I'm only getting 792,208,037 reads. In addition to this, nothing else from the same sample is shared between my run and the papers. Is this common when running STARsolo? I think I've gone pretty straight-forward with the parameters, so I'm not sure what would cause such a huge discrepancy in results. If anyone has any idea, I appreciate it. Thanks.

STARsolo Single-cell RNAseq • 501 views
ADD COMMENT
1
Entering edit mode

I have not been getting the same reads as the metadata supplied from the paper.

Many (most?) NGS data aligners produce non-deterministic output. Unless they provide a seed or an option to create deterministic output. Using multi-threading can also affect this. Are you using the same exact version of STAR (and command line options) as in the paper? All of these difference could add up to the results you are observing.

ADD REPLY
0
Entering edit mode

Unfortunately, they don't specify any parameters apart from using GeneFull, and the reference genome. Other than this, every parameter is an assumption, but I completely understand. Pursuing the difference in outputs, would there be a noticeable difference between data in downstream analysis?

ADD REPLY
1
Entering edit mode

In addition to this, nothing else from the same sample is shared between my run and the papers.

What do you mean by this? This might be more significant than ~9% read count difference.

ADD REPLY
0
Entering edit mode

True! I didn't want to include it all in the original post but all the results are different somewhat:

Mine:

Number of Reads,792208037
Reads With Valid Barcodes,0.981446
Sequencing Saturation,0.687174
Q30 Bases in CB+UMI,0.952513
Q30 Bases in RNA read,0.940124
Reads Mapped to Genome: Unique+Multiple,0.969983
Reads Mapped to Genome: Unique,0.888205
Reads Mapped to GeneFull: Unique+Multipe GeneFull,0.784689
Reads Mapped to GeneFull: Unique GeneFull,0.734297
Estimated Number of Cells,12235
Unique Reads in Cells Mapped to GeneFull,431842812
Fraction of Unique Reads in Cells,0.742361
Mean Reads per Cell,35295
Median Reads per Cell,28407
UMIs in Cells,133461905
Mean UMI per Cell,10908
Median UMI per Cell,8851
Mean GeneFull per Cell,3414
Median GeneFull per Cell,3171
Total GeneFull Detected,30842

Paper:

Number of Reads,874963983   
Reads With Valid Barcodes,0.982058  
Sequencing Saturation,0.633419  
Q30 Bases in CB+UMI,0.956098    
Q30 Bases in RNA read,0.942693  
Reads Mapped to Genome: Unique+Multiple,0.978124    
Reads Mapped to Genome: Unique,0.89299  
Reads Mapped to GeneFull: Unique+Multipe GeneFull,0.810419  
Reads Mapped to GeneFull: Unique GeneFull,0.759404  
Estimated Number of Cells,14284 
Unique Reads in Cells Mapped to GeneFull,516677847  
Fraction of Unique Reads in Cells,0.777601  
Mean Reads per Cell,36171   
Median Reads per Cell,25328 
UMIs in Cells,188372051 
Mean UMI per Cell,13187 
Median UMI per Cell,9369    
Mean GeneFull per Cell,3163 
Median GeneFull per Cell,3129   
Total GeneFull Detected,31095

The difference probably accounts for the reads... just interesting. From the above response, I'm assuming it's okay to move forward, as getting an exact output would be difficult without the exact parameters!

ADD REPLY
0
Entering edit mode

I see, yeah it looks all related to number of reads. I thought you might have meant downstream analysis wasn't replicating either!

I agree replicating exact output may be difficult because it can depend on details, but the main conclusions should be consistent. If they aren't, then that's definitely the time to start asking questions.

ADD REPLY

Login before adding your answer.

Traffic: 1924 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6