Hello Shawn, there are a number of things to keep in mind:
First, as dsull pointed out, what you call a 129bp long barcode is interspersed barcodes and UMI sequences. The barcode in Parse’s SPLiT-seq are 24bp long, and appear in Read 2 separated by linkers of different lengths. A schematic of each read 2 would be as follows:
UMI (10bp) – Barcode 3 (8bp) – Linker (30 bp) – Barcode 2 (8bp) – Linker (22bp) – Barcode 1 (8bp)
With each barcode sequence (1, 2 and 3) coming from each of the three split-pool steps during the combinatorial indexing. The barcode sequence that identifies the cell of origin for each read can be obtained by concatenating Barcodes 3, 2 and 1, in that order. You may want to make sure that the structure I wrote above matches your specific experiment, as linker length varies between split-seq platforms. However, that should be the structure for Parse’s SPLiT-seq as far as I am concerned.
Second, and as dsull also pointed out, if you know the position of the barcode in each read you can tell Kallisto to extract those different positions and concatenate them together. This can be achieved by using the syntax dsull mentions above. However, and for your convenience, a colleague and I have implemented the Parse’s split-seq technology in this Kallisto github fork : https://github.com/bound-to-love/kallisto.git
Third, the different samples in a SPLiT-Seq experiment are codified by the sequence of the Barcode 1, this is, the sequence attached to the transcripts in the first split-pool process. It works as follows: the Barcode 1 sequence of each well of the plate is known, and has an ID number associated to it. Therefore, by adding each sample (or experimental condition) in a specific set of wells with known IDs, you can recover which sample the cell comes from by checking the Barcode 1 sequence. In order to de-multiplex the samples, you need to know how the experiment was performed, which wells of the first plate were used for which sample, and which Barcodes 1 were present in each well.
Finally, there is another layer of complexity to SPLiT-seq analysis. This happens in some cases, but I am not sure if this is the case for yours. In the above mentioned first round of barcoding, there can be two barcodes 1 per well, instead of one. This apparently serves as an internal control, as poly(A) transcripts will get one barcode, and those amplified through random hexamers will get the other. I am not sure how that internal control is performed, but this is obviously relevant for processing the data. The main implication is that the transcripts of each cell are not uniquely labeled, but can have two different barcode sequences (either one or another), therefore appearing as two different cells in the analysis.
For example, we may know that in well 1 of the first split-pool step are two Barcodes 1: Barcode 1-A and Barcode 1-B, with IDs ID-2 and ID-50. We have cells from an experimental condition X which we decide to pipette on well 1. The transcripts from the cells of that condition will be labeled with either barcode ID-2 or barcode ID-50. In the subsequent split-pool steps, each cell of that condition will get a different Barcode 2 and Barcode 3. When we analyze the data, the transcripts of the cells of experimental condition X will be labeled with a specific combination of Barcode 2 and 3 AND with both Barcodes 1 ID-2 and ID-50. Therefore, we need to combine the counts coming from “cells” with the same Barcode 3 and 2 and with either Barcodes 1 ID-2 or ID-50 as a single cell, even though they have different barcode sequences. It is a small step, but a bit confusing and annoying.
My colleague and I have written a google colab notebook that you can use as a template for your analysis: https://colab.research.google.com/drive/1eQ_2pZOaCk_-5n0LuNP4iLoS4oY8wM6I?usp=sharing
It does the following:
1) Installs and imports relevant packages, including the Kallisto version with SPLiT-seq technology, bustools, and kb-python.
2) Downloads your reads and your whitelist (you need to complete this part)
3) Downloads a pre-built index for Kallisto
4) Runs Kallisto-Bustools on your reads. You can use your whitelist (provided at step 2) to correct the reads or generate a whitelist from your reads. Both options are included.
5) Store kallisto-bustools output in anndata object.
6) Assign an ID to each cell according to its barcode 1. I have included a barcode 1-to-ID map in the code, but you need to make sure your experiment used the same.
7) Combine “cells” with matching barcodes 1 as explained above.
The output of all this code is an anndata object with cells annotated by ID, which you can use to identify the original sample each cell belongs to.
I hope this is useful. Please let us know if you have any other question!
I've never worked with splitseq before but it seems like the "barcodes" are actually barcode sequences interspersed with linker sequences (e.g. barcode-linker-barcode-linker-barcode). I'd use -x to extract the barcode sequences (not the linker sequences).
For example, using bc:umi:seq, you can do -x 0,0,5,0,12,17:1,0,20:2,0,0 In that example, for your barcode, you get the first 5 bases of file #0 concatenated with 5 bases (starting at position 12) of that same file.