Hello,
I have a single cell pipeline using kallisto bus for the pseudoalignment and then bustools to correct barcodes and sort them. I have an issue of huge decrease of BUS records while running bustools correct
on some libraries. For examples for library SRX12187867 I get that output from bustools correct
and bustools sort
.
Found 6794880 barcodes in the whitelist
Processed 139893071 BUS records
In whitelist = 197236
Corrected = 1746603
Uncorrected = 137949232
Sort bus file
Read in 1943839 BUS records
the output.bus
file generated by kallisto bus
for this library is 4.4G (for ~140M BUS records). The bustools corrected file generated by bustools correct
is only 60M (for ~1.9M BUS records).
I also processed other libraries from the same experiment and did not have the same issue.
Do you have any ideas what could be the reason for such a huge decrease of BUS records? I double checked the whitelist I used and I properly use the 10X V3 whitelist provided by cellranger.
Following @dsull comment the run_info.json
file generated by kallisto is
{
"n_targets": 0,
"n_bootstraps": 0,
"n_processed": 619876553,
"n_pseudoaligned": 110019873,
"n_unique": 82007271,
"p_pseudoaligned": 17.7,
"p_unique": 13.2,
"kallisto_version": "0.46.0",
"index_version": 0,
"start_time": "Mon Jan 15 09:26:58 2024",
"call": "kallisto bus -i /data/index/Homo_sapiens.GRCh38.transcriptome.idx -o /data/kallisto_bus_results/SRX12187867 -x 10xV3 -t 4 /data/FASTQ/9606/SRX12187867/SRR15896987/FASTQ/SRR15896987_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896987/FASTQ/SRR15896987_R2.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896989/FASTQ/SRR15896989_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896989/FASTQ/SRR15896989_R2.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896991/FASTQ/SRR15896991_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896991/FASTQ/SRR15896991_R2.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896993/FASTQ/SRR15896993_R1.fastq.gz /data/FASTQ/9606/SRX12187867/SRR15896993/FASTQ/SRR15896993_R2.fastq.gz"
}
And a subset of the R1 and R2 files for one sample :
[jwollbre]$ zcat SRR15896991/FASTQ/SRR15896991_R1.fastq.gz | head -4
@SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
GCTGTTTTAGGGGATTGTAGGTACTATATGGAAATGAAGATTTTAAAGTAGGGTAAGATCAGACCCAGCTGCTCCACTCCTGAGTGTTTACTCACCTAGGAAATACACATCCATACAAAGACTTGTACATACGTGGTTTTAGCAGCTTATT
+SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
[jwollbre]$ zcat SRR15896991/FASTQ/SRR15896991_R2.fastq.gz | head -4
@SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
CNCATGACACGCGCATATATTCCTTGATTAATAAAAGCATCTACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR15896991.1 A01010:52:H3MHTDSXY:3:1101:1994:1000 length=151
???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
Hello! What you're seeing means that the barcodes extracted from your FASTQ files don't appear in the supplied "whitelist".
This can happen if the "kallisto bus" command being run is incorrect (e.g. perhaps you switched the R1 and R2 files? perhaps you supplied the index file instead of the R1 file? etc.). Can you show me the command you are running and the run_info.json file produced by "kallisto bus"?
Thanks for your comment. I updated my question to add the run_info.json file and an example of the R1 and R2 files.
R2/R1 files are indeed switched. Giveaway is the poly-A tail that you see in your Read 2. That is represented in the library structure: https://kb.10xgenomics.com/hc/en-us/articles/360000939852-What-is-the-difference-between-Single-Cell-3-and-5-Gene-Expression-libraries
Read 1 is normally UMI + Cell barcodes and then you will see a stretch of poly-A's, if read 1 is sequenced longer than 26 or 28 bp. Read 2 immediately reads into RNA insert.
That is exactly what I was scared of. Do you have any advice on how to check for file names to detect those cases where R1 and R2 are switched and both have the same read length? I would like to add a check to my pipeline and be 100% sure the name of the file is correct.
Not sure where and how you got the data from but using
I get three files.
_1
= Illumina index_2
= cell barcode + UMI_3
= RNA read.Thanks GenoMax I did the same and arrived exactly at the same conclusion than you.
I had issues with some SRA libraries download with
fastq-dump
. R1 and R2 files were not properly named _1 and _2. (or _2 and _3 if a I1 file exists). I used the wrong and naïve assumption that read length could help me solve this issue by renaming all downloaded files based on read length (I1 < R1 < R2).I am now going to use fasterqParseR approach to check files downloaded from SRA.
I found the R package fasterqParseR. This package is not only using read length but also the "whitelist" to correct for file name. It is quite elegant even if it is probably not 100% bullet proof. If you have any other idea or know any other tools dealing with that issue please let me know and thanks for your answers.
I don't think you can check for the switch based on the file name. It is the reads inside that gives thing away.