I have over 800,000 fastq.gz files after demultiplexing and am trying to combine them based on barcodes (BCs) and basenames. Below is an example of my data. Each file has a basename (sample#) and a BC1 (BC_#)
sample1_BC1_1_R1.fastq.gz
sample1_BC1_49_R1.fastq.gz
sample1_BC1_2_R1.fastq.gz
sample1_BC1_50_R1.fastq.gz
sample2_BC1_1_R1.fastq.gz
sample2_BC1_49_R1.fastq.gz
sample2_BC1_2_R1.fastq.gz
sample2_BC1_50_R1.fastq.gz
I want to combine files that have the same basename and a specific set of BC1 identifiers so that the following BC1 identifiers would be combined. In other words, each sample received two different BC1s.
BC1_1 and BC1_49
BC1_2 and BC1_50
BC1_3 and BC1_51
...
48 and 96
For the example above with 8 files, my output would be 4 files...
sample1_BC1_1-49_R1.fastq.gz
sample1_BC1_2-50_R1.fastq.gz
sample2_BC1_1-49_R1.fastq.gz
sample2_BC1_2-50_R1.fastq.gz
How can I do this in linux or python? Or even R? Thank you in advance! I haven't quite reached high proficiency with linux or python yet, so any help is welcomed.
I have tried looping through files to identify files with similar basenames but am having trouble concatenating the files given they have the right BC1 identifiers.
Curious about how the data ended up in this format? Is this some kind of custom single cell data design? If it is one of the standard single-cell platforms then this may have been made more complicated than necessary.
Hi,
Yes, this is a custom single-cell protocol where I had to demultiplex myself. And yes, the BC1 given to all samples and is combined with the numbers I mentioned above so that it should be...
I edited my original post. Hopefully that helps provide more insight into how I can solve this issue!