Question

Handling multiple fastq files per sample, per lane, per read out of the Cellranger bam to fastq workflow

0

Entering edit mode

20 months ago

Erfanesi • 0

For a set of downloaded bam files from PRJNA625920 in SRA, I used 10x Genomics' "bam to fastq" tool but got 25 fastq files per sample per lane per read like this (the same goes for R1 and R2):

Donor10OC.bam_S1_L001_I1_001.fastq.gz
Donor10OC.bam_S1_L001_I1_001.fastq_2.gz
Donor10OC.bam_S1_L001_I1_001.fastq_3.gz
..
..
Donor10OC.bam_S1_L001_I1_001.fastq_24.gz
Donor10OC.bam_S1_L001_I1_001.fastq_25.gz

I assume that these are technical replicates as they represent the same sample (S1) and the same lane (L001).

Is my assumption correct? If yes, merging them by simply concatenating them does the job or something else should be done? If no, how to handle this situation?

Your expert advice is highly appreciated.

bamtofastq cellranger RNA-seq • 2.2k views

ADD COMMENT • link 20 months ago by Erfanesi • 0

1

Entering edit mode

It looks like bamtofastq defaults to

--reads-per-fastq=N Number of reads per FASTQ chunk. Default: 50000000

Have you checked to see how many reads there are in each file? 25 is a large number of files if you did not change the default above.

ADD REPLY • link 20 months ago by GenoMax 152k

0

Entering edit mode

Thanks for your quick reply GenoMax! I checked the read counts for the largest and smallest files using the code shared here and got the following numbers:

Largest file: 13121784
Smallest file: 306550

ADD REPLY • link 20 months ago by Erfanesi • 0

1

Entering edit mode

As long as the files came from one BAM that you know belonged to one sample it should be fine to cat the files together. Are you planning to run cellranger? It may understand the file pieces so you may not need to do anything.

ADD REPLY • link 20 months ago by GenoMax 152k

0

Entering edit mode

20 months ago

swbarnes2 15k

The names you have now will not work, that fastq_2.gz will not be accepted by cellranger. Cellranger is very picky about the fastq names looking exactly as if they came off of the illumina instrument. Merging into one giant fastq will work fine, so long as it's named properly.

The other option is to alter the names so that they look like they came from the same sample, but different lanes (cellranger won't care that there is no instrument with 25 lanes). 10X will understand that they should all be processed together. You can also do this by making symlinks with correct names.

ADD COMMENT • link 20 months ago by swbarnes2 15k

1

Entering edit mode

Curious that their own conversion utility made the files in this format. Not a great way of handling the data if it now requires additional changes.

ADD REPLY • link 20 months ago by GenoMax 152k

score 1 · Accepted Answer · 2023-11-16

After some investigation, I realized that this problem occurs when bam files contain more than one output directory. Every directory produced by bam to fastq is related to indices in bam file as indicated here: [10X website|https://kb.10xgenomics.com/hc/en-us/articles/360058600992-How-do-I-find-out-which-FASTQ-files-belong-to-which-library-in-10x-Genomics-bamtofastq-output-folders-]. Since samples in every directory are named the same, collecting all fastq files in one directory resulted in multiple suffixes relating to the number of created directories by Cell Ranger bamtofastq.