Question

Creating fastq subsets from existing files

0

Entering edit mode

16 months ago

Zack • 0

Hello,

I'm new to this area. Please help me with this. I'd like to generate test fastq files that contain only a few hundred subset of cells from paired-end scRNA seq fastq files downloaded from https://www.10xgenomics.com/resources/datasets/whole-blood-rbc-lysis-for-pbmcs-neutrophils-granulocytes-3-3-1-standard. Here are the steps I have tried:

First I generated a whitelist of barcodes using umi_tools for R1 with the following command:

umi_tools whitelist --stdin WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R1_001.fastq.gz \
--extract-method=string \
--bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
--log2stderr > whitelist.txt

Then, I used a subset of these barcodes to extract the R1_subset.fastq and R2_subset.fastq files using the following commands:

umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \
--stdin WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R1_001.fastq.gz \
--stdout WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R1_001_extracted.fastq.gz \
--read2-in WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R2_001.fastq.gz \
--read2-out=WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R2_001_extracted.fastq.gz \
--whitelist=whitelist1_subset.txt

where whitelist1_subset.txt is given like

AAAGAACTCGGCATCG
AAAGGATTCGCAGATT
AAAGGTATCCAAGCAT
AACAAAGAGAACAAGG
AACAAGATCGAATCCA

However, when I tried using cell ranger on the extracted fastq files, I encountered the following error message:

Log message:
The read lengths are incompatible with all the chemistries for Sample WB_Lysis_Granulocytes_3p_Introns_8kCells in "/mnt/ufs18/home-238/suzhe/Data/scrnaseqworkshop/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/subset".
 - read1 median length = 2
 - read2 median length = 90
 - index1 median length = 0

The minimum read length for different chemistries are:
SFRP     - read1: 26, read2: 30, index1: 0
SC5P-R2  - read1: 26, read2: 25, index1: 0
SC5P-PE  - read1: 81, read2: 25, index1: 0
SC3Pv1   - read1: 25, read2: 10, index1: 14
SC3Pv2   - read1: 26, read2: 25, index1: 0
SC3Pv3   - read1: 26, read2: 25, index1: 0
SC3Pv3LT - read1: 26, read2: 25, index1: 0
SC3Pv3HT - read1: 26, read2: 25, index1: 0

We expect that at least 50% of the reads exceed the minimum length.

Is there anyone who can help resolve this issue? The original fastq files work with cell ranger.

Thank you very much!

scRNA-seq umi_tools cellranger fastq • 1.6k views

ADD COMMENT • link 16 months ago by Zack • 0

0

Entering edit mode

Why don't you use seqtk to generate test fastq files for a subset of cells and then run cellranger in it? You can do something like this-

#Download fastq
wget https://cf.10xgenomics.com/samples/cell-exp/6.1.0/WB_Lysis_Granulocytes_3p_Introns_8kCells/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs.tar

#untar the file
tar -xvf WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs.tar

DIR=WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/
TRANSCRIPTOME=cellranger/refdata-gex-GRCh38-2020-A
OUTDIR=10k

r1=WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R1_001.fastq.gz
r2=WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R2_001.fastq.gz

r3=WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L002_R1_001.fastq.gz
r4=WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L002_R2_001.fastq.gz

seqtk sample -s100 ${DIR}/${r1} 10000 > ${DIR}/WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-10000_S1_L001_R1_001.fastq.gz
seqtk sample -s100 ${DIR}/${r2} 10000 > ${DIR}/WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-10000_S1_L001_R2_001.fastq.gz

seqtk sample -s100 ${DIR}/${r3} 10000 > ${DIR}/WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-10000_S1_L002_R1_001.fastq.gz
seqtk sample -s100 ${DIR}/${r4} 10000 > ${DIR}/WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-10000_S1_L002_R2_001.fastq.gz

cellranger count \
--fastqs=${DIR} \
--id=WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-10000 \
--transcriptome=${TRANSCRIPTOME} \
--sample=WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-10000

For the reference check here- https://bioinf.shenwei.me/seqkit/usage/#sample

ADD REPLY • link 16 months ago by bk11 ★ 3.0k

0

Entering edit mode

Thank you for your reply. Actually I tried seqtk previously and it gave me an error message when using cellranger count on the generated fastq files:

Log message:
FASTQ header mismatch detected at line 4 of input files "/WB_subset_r10000/WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R1_001.fastq.gz" and "/WB_subset_r10000/WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_I1_001.fastq.gz": file: "/WB_subset_r10000/WB_Lysis_Granulocytes_3p_Introns_8kCells_S1_L001_R1_001.fastq.gz", line: 4

Not sure how to fix this. It seems that cellranger count also requires at least 10000 reads, which makes the test experiment slow. What is the reason for that?

In addition, is getting a subset of reads the same as getting a subset of cells (barcodes)?

ADD REPLY • link 16 months ago by Zack • 0

0

Entering edit mode

Here is the log message from cellranger when your reads is less than 10000-

Log message:
There were not enough reads to auto detect the chemistry: Sample WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-500 in "WB_Lysis_Granulocytes_3p_Introns_8kCells_sub-500"
Note that you can avoid auto detection by specifying the specific chemistry type and version.
- Minimum number of required reads = 10000
- Number of reads available = 1000

ADD REPLY • link 16 months ago by bk11 ★ 3.0k

0

Entering edit mode

Yes, this is the log message if the reads is less than 10000. Do you know why cell ranger requires at least 10000 reads?

ADD REPLY • link 16 months ago by Zack • 0

0

Entering edit mode

As stated in the log file- if you use less than 10000 reads, there will not be enough reads to detect the specific chemistry type and version.

ADD REPLY • link 16 months ago by bk11 ★ 3.0k

0

Entering edit mode

I am just confused why 10000 reads provide sufficient information for the chemistry type and version but 9999 reads do not. What is the reason to set it to 10000?