In this paper the authors describe a custom iterative barcoding strategy to index single cells for single cell RNA-Seq without droplets or cell isolation. They have archived their data on GEO: GSE110823. However I was unable to find any scripts or bioinformatics tools that they used to process the raw data.
I have attempted to throw together a SPLiT-Seq demultiplexing strategy using bash. With the goal of making it more robust, accurate, and functional I wanted to share it with the bioinformatics community. Any comments, criticisms, bug reports, or pointers to any official SPLiT-Seq tools are much appreciated.
It should be directly applicable to any SPLiT-Seq dataset using the published indexes. If new indexes are used a user can simply change the barcodes.txt files to reflect their index sequences.
Can you describe how the function of this is different from the standard bcl2fastq? From the looks of your script, you are starting with .fastq files (bcl2fastq starts from .bcl), not sure how much divergence in functionality there is after that step.
I am not an expert on bcl2fastq but I understand that it is used as the first demultiplexing step after Illumina sequencing taking the .bcl files generated by the sequencer then both demultiplexing and converting them to standard .fastq format for each indexed sample. To do this bcl2fastq is capable extracting samples indexed with standard Illumina barcodes from one of their index kits. Because SPLiT-Seq uses a custom 3 stage ligation based barcoding design in addition to standard illumina indexes I assume that bcl2fastq is not capable of demultiplexing the single cells based on this non-illumina indexing strategy. As an example the archived SPLiT-Seq datasets on GEO appear to have had the illumina indexes removed while the split-seq barcodes 1-3 are still there.
I am sure there are similarities as both tools try to demulitplex samples... so I would be very interested to know if bcl2fastq could be made to work on SPLiT-Seq data. Perhaps other people here may have more insight.
Hello, firstly I would like to appreciate for your work. But there is some doubts about the Round*_barcodes_new.txt files. I am curious about
why round1 barcode is 13 bits and 2 is 13 bits, 3 is 14or 15 bits. We know that the round1 unique barcode is 8bits, yet you add 5 redundant
bits which is totally the same in all barcodes.
The barcodes can be redundant because they have flanking sequences used for the ligation that are distinct. As a result the position of each barcode and its flanking sequence in the resulting read is unique. The several bases of flanking sequence added to the 8bit barcode are included in the "barcodes_new3.txt" files to allow the script to detect a match unique to the intended barcode position. For example, a Round2 barcode will not detect a barcode added during Round 1 or Round 3 because they lack the Round2 flanking sequence. If you look at an example read sequence and the barcode sequences that I linked to above this logic will become clear.
Thanks a lot for your meticulously reply. Now I have noticed that the full dataset which is used for benchmarking contains only 131 nuclei (according to the paper )while your script generated 14,423 .fastq files (cells) as output. I am new to bioinformatics and hope my question do not bother you a lot.
@rainwashautumn I have also noticed this discrepancy with GSM3017260 (SRR6750041). I emailed the authors of the paper months ago about this point but never received a response. I think the issue may be one of two things 1. the dataset is mislabeled in the GEO archive or 2. They performed additional filtering to somehow arrive at 131 nuclei... eliminating a very large number of barcode combinations...
From the testing we have done on "splitseqdemultiplexing.sh" we know that it performs as expected on datasets containing a computationally predefined number of barcode combinations. We also have confidence that each of the barcode combinations identified exist in the .fastq file downloaded from SRR6750041 because "splitseqdemultiplexing.sh" records each combination found and we can confirm its presence in the raw reads.
Finally, we seem to get results much closer to those reported in the Split-seq paper if we process other files for example GSM3017260 (SRR6750042). Anyway you point out a very interesting discrepancy that I have not been able to wrap my head around. If you figure it out please share your findings with us here.
HUGE UPDATE to SPLiT-Seq_Demultiplexing released on Jan_16_2019. An enormous increase in speed was achieved through rewriting steps 1 and 2 in python to make use of hash data structures. Support for a user defined number of mismatches was added. Suggestions, comments, and criticisms welcome.
I have a few questions and some feedback on the code:
Questions:
What is a "round"?
Why are there 3 rounds?
Is that the typical use case or can be there more (or fewer) "rounds"? (I see that the paper answers these questions)
Why use cut -b 1- to read the files into an array?
Code feedback:
I'd recommend switching to files passed as command line arguments and processed using getopts, instead of hard-coding filenames in scripts. That way, you can have your input file wherever
Use set -eo pipefail so the script stops if any step fails
Ensure log and out files can be created (i.e. dir has write permissions), or better yet, just write to STDOUT and STDERR as appropriate
Do not implicitly rm. Instead, create directories of the format prefix.XXX using mktemp [-d]. See here
There are a lot of loops and conditionals, maybe restructure the code so you need fewer loops? Better yet, switch to python. Shell loops are bad (Advanced topic)
Unless you're using regular expressions, grep -F (or grep -Fx if you're matching an entire line) will reduce the load on grep.
You're exporting and parallelizing the grep function, why not do that all over the script instead of just in one place?
Ram, thanks very much for your feedback I appreciate it.
Explanation of "rounds": The SPLiT-Seq science paper describes 3 successive "rounds" of barcoding. I am borrowing the round1, round2, round3, terminology from their publication. Each round of barcoding done in the wet lab is performed through the addition of a specific set of barcode Oligos. The typical use case is 3 rounds of barcodes. Three rounds of barcoding are used to achieve a high number of possible barcode combinations. Round1 (48) x round2 (96) x round3 (96) = 442,368 possible combinations.
Code feedback: Thanks for your detailed comments. I will work on revising the code.
The person on SO has parallelized the innermost loop, I'm guessing you could try applying that logic (abstracting the logic as a function, using input files are arguments and parallelizing the function) to the outer loops as well, although I'm not sure how the task distribution would work in such a implementation.
To flesh this out more as a fully fledged tool, I would suggest showing some run time comparisons for different data sizes etc.
The
GitHub
link appears to be incorrect.Thanks, link updated.
Is your script generally applicable/easily extensible to other SPLiT-seq datasets or is it for this specific one you mention above?
It should be directly applicable to any SPLiT-Seq dataset using the published indexes. If new indexes are used a user can simply change the barcodes.txt files to reflect their index sequences.
Thanks! You may want to mention that in original post/on GitHub.
Can you describe how the function of this is different from the standard
bcl2fastq
? From the looks of your script, you are starting with .fastq files (bcl2fastq
starts from .bcl), not sure how much divergence in functionality there is after that step.I am not an expert on bcl2fastq but I understand that it is used as the first demultiplexing step after Illumina sequencing taking the .bcl files generated by the sequencer then both demultiplexing and converting them to standard .fastq format for each indexed sample. To do this bcl2fastq is capable extracting samples indexed with standard Illumina barcodes from one of their index kits. Because SPLiT-Seq uses a custom 3 stage ligation based barcoding design in addition to standard illumina indexes I assume that bcl2fastq is not capable of demultiplexing the single cells based on this non-illumina indexing strategy. As an example the archived SPLiT-Seq datasets on GEO appear to have had the illumina indexes removed while the split-seq barcodes 1-3 are still there.
I am sure there are similarities as both tools try to demulitplex samples... so I would be very interested to know if bcl2fastq could be made to work on SPLiT-Seq data. Perhaps other people here may have more insight.
Hello, firstly I would like to appreciate for your work. But there is some doubts about the Round*_barcodes_new.txt files. I am curious about why round1 barcode is 13 bits and 2 is 13 bits, 3 is 14or 15 bits. We know that the round1 unique barcode is 8bits, yet you add 5 redundant bits which is totally the same in all barcodes.
The 8bit barcode sequences used in the SPLiT-Seq protocol are redundant between the different barcoding Rounds. You can confirm this here https://drive.google.com/file/d/1l3MkxYVWOPTV5KPpq2uXMLzKOKwVeZyw/view?usp=sharing
The barcodes can be redundant because they have flanking sequences used for the ligation that are distinct. As a result the position of each barcode and its flanking sequence in the resulting read is unique. The several bases of flanking sequence added to the 8bit barcode are included in the "barcodes_new3.txt" files to allow the script to detect a match unique to the intended barcode position. For example, a Round2 barcode will not detect a barcode added during Round 1 or Round 3 because they lack the Round2 flanking sequence. If you look at an example read sequence and the barcode sequences that I linked to above this logic will become clear.
Thanks a lot for your meticulously reply. Now I have noticed that the full dataset which is used for benchmarking contains only 131 nuclei (according to the paper )while your script generated 14,423 .fastq files (cells) as output. I am new to bioinformatics and hope my question do not bother you a lot.
@rainwashautumn I have also noticed this discrepancy with GSM3017260 (SRR6750041). I emailed the authors of the paper months ago about this point but never received a response. I think the issue may be one of two things 1. the dataset is mislabeled in the GEO archive or 2. They performed additional filtering to somehow arrive at 131 nuclei... eliminating a very large number of barcode combinations... From the testing we have done on "splitseqdemultiplexing.sh" we know that it performs as expected on datasets containing a computationally predefined number of barcode combinations. We also have confidence that each of the barcode combinations identified exist in the .fastq file downloaded from SRR6750041 because "splitseqdemultiplexing.sh" records each combination found and we can confirm its presence in the raw reads.
Finally, we seem to get results much closer to those reported in the Split-seq paper if we process other files for example GSM3017260 (SRR6750042). Anyway you point out a very interesting discrepancy that I have not been able to wrap my head around. If you figure it out please share your findings with us here.
HUGE UPDATE to SPLiT-Seq_Demultiplexing released on Jan_16_2019. An enormous increase in speed was achieved through rewriting steps 1 and 2 in python to make use of hash data structures. Support for a user defined number of mismatches was added. Suggestions, comments, and criticisms welcome.