Should I process two complete sets of 10x single-cell multiomics sequencing files from one donor together or separately?
2
0
Entering edit mode
3 days ago

Hi All,

I am a little confused about the processing manner of 10x single-cell multiomics sequencing files from one donor. For example, this ENCODE project (ENCSR889JIE) contains two complete sets of sequencing files from one donor. Taking scRNA-seq (check the "File details" tab) as an example, it contains S31 and S7:

S31:
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S31_L001_R1_001.fastq.gz
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S31_L001_R2_001.fastq.gz
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S31_L002_R1_001.fastq.gz
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S31_L002_R2_001.fastq.gz

S7:
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S7_L001_R1_001.fastq.gz
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S7_L001_R2_001.fastq.gz
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S7_L002_R1_001.fastq.gz
linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG_S7_L002_R2_001.fastq.gz

The question is whether I should:

  1. Process them separately, generating two outputs (e.g. run Cell Ranger ARC twice, once on the S7 FASTQs and once on the S31 FASTQs), or
  2. Combine (“merge”) them into a single run so that I end up with one set of output for this donor.

I know how to set up libraries.csv when there’s only a single set of FASTQs (e.g., ENCSR000ULP with separate folders for RNA and ATAC). But in this case, if I want to merge S7 and S31, how do I structure my folders and/or modify libraries.csv so that Cell Ranger ARC knows it’s all one sample but from multiple lanes?

Below is my usual command for a single set:

cellranger-arc count --id=${project_name} \
                     --reference=${reference_dir}/refdata-cellranger-arc-GRCh38-2024-A \
                     --libraries=${work_dir}/libraries.csv \
                     --localcores=24 \
                     --localmem=180

And a typical libraries.csv:

fastqs,sample,library_type
/ENCSR000ULP/RNA,linlab2_041122_snRNA-CGCGCACTTA-AGAATACAGG,Gene Expression
/ENCSR000ULP/ATAC,linlab2_041122_scATAC-AATCACTA-CCGAGAAC-GTAGTGCG-TGCTCTGT,Chromatin Accessibility

For the sample column, it is the string before the S index.

Now let's go back to the ENCSR889JIE which contains two sets of sequencing files (two S indices). If I want to "merge" and process two sets of sequencing files together, what should I modify:

  1. Do I just put both S7 and S31 FASTQs into the same RNA (or ATAC) directory and libraries.csv file would have two rows pointing to RNA and ATAC, respectively?
  2. Or do I need two rows (S7 and S31) for RNA and another two rows (S7 and S31) for ATAC in the libraries.csv?
  3. Or is there any recommended best practice for multiple S indices from the same donor?

Thank you very much!

10x_multiome ENCODE cellranger_arc • 835 views
ADD COMMENT
0
Entering edit mode

It is a little confusing since the same sample can't be included in two rows when demultiplexing the data using bcl-convert (which leads to the S* based on the sample location in rows of samplesheet). So not sure why there are two S* for one index pair. Only explanation would be it is a technical sequencing replicate where the sample ran on two flowcells.

ADD REPLY
0
Entering edit mode

Hi GenoMax , thank you very much! So, if this is just technical replicates, does it mean I can reasoablely "merge" S7 and S31 and them process them together. To be specific, just put S7 and S31 RNA sequencing files in the same RNA folder, and so does the ATAC seqeuencing files, and build a libraries.csv like below which ignore the S indices

fastqs,sample,library_type
/ENCSR889JIE/RNA,linlab2_041122_snRNA-CGCGGTAGGT-CAACATCCTG,Gene Expression
/ENCSR889JIE/ATAC,linlab1_031522_scATAC-ACGCTTGG-CGCTACAT-GAAAGACA-TTTGCGTC,Chromatin Accessibility

And the cellranger arc will have only one output?

Or do you mean that this kind of technical replicates sounds like something wrong and I should consult with the experimentalists to drop one replicates?

Thank you very much!

ADD REPLY
1
Entering edit mode
2 days ago
GenoMax 150k

Files included in this link are technical replicates from lanes 1 through 4 and have the same S21 number: https://www.encodeproject.org/experiments/ENCSR345CVL/

Looking at the example you posted first, those files are from two separate flowcells so that explains the differing S* numbers (see full file paths that include flowcell ID in bold). As the index pair is identical (along with the sample name) these are indeed technical sequencing replicates and can be processed as such.

/oak/stanford/scg/prj_ENCODE/Staging2/220421_A00509_0503_BHTHCYDSX3-linlab2_041122_snRNA/linlab2_041122_snRNA-CGCGGTAGGTCAACATCCTG_S31_L001_R1_001.fastq.gz

/oak/stanford/scg/prj_ENCODE/Staging2/220321_A00509_0476_BH7T7TDRX2-linlab1_031522_snRNA/linlab1_031522_snRNA-CGCGGTAGGT-CAACATCCTG_S7_L002_R1_001.fastq.gz

ADD COMMENT
0
Entering edit mode
21 hours ago

If they are from the same library they ought to be processed together. If you have one cell_barcode-gene-UMI combination in each set of fastqs, you want to remove the dupes, right?

You could rename the fastqs, but making synlinks with the desired names is generally safer. So make them both S7, but change the lanes in the S31 samples to lanes L003 and L004. Cellranger will understand that they are all the same library, and will process them accordingly.

ADD COMMENT

Login before adding your answer.

Traffic: 1546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6