Hi everyone,
I am trying to understand the entries of the inspect.json
file that is output by the kb-python
wrapper tool by the Pachter lab. I believe that this file is an output from bustools
. I have analysed (performed pseudoalignment and quantification with kb-python
) the standard PBMC_1K_V3 scRNAseq dataset from 10x chromium.
Here is a screenshot of my inspect.json
file obtained from running kb-python
on the PBMC_1K_V3 dataset:
{
"numRecords": 18768851,
"numReads": 39738410,
"numBarcodes": 518890,
"medianReadsPerBarcode": 3.000000,
"meanReadsPerBarcode": 76.583496,
"numUMIs": 8050936,
"numBarcodeUMIs": 13514564,
"medianUMIsPerBarcode": 1.000000,
"meanUMIsPerBarcode": 26.045143,
"gtRecords": 7713525,
"numBarcodesOnWhitelist": 239530,
"percentageBarcodesOnWhitelist": 46.162000,
"numReadsOnWhitelist": 38105469,
"percentageReadsOnWhitelist": 95.890774
}
My questions are:
- What do the various keys mean? For example, what does numRecords and numBarcodes mean? I suppose numRecords is the number of entries of the BUS format, but why is numBarcodes so high? The fastq files analysed corresponds to approximately 1000 cells, so why does the number of barcodes (I assume cell barcodes) exceed the number of cells by this much? I am assuming that number of corrected barcodes = number of cells.
- How important is this file for checking the performance of kb-python? I can understand the output of the
kb_info.json
andrun_info.json
files as they are self-explanatory, but the contents ofinspect.json
baffles me and I have yet to find any documentation of its contents online.
For your information, I share the cellRanger output cellRanger_summary and the run_info.json
screenshot
.
Thank you for reading and I appreciate your time.
Correct, which is why you should load your output into R or python and filter your cells (i.e. do the "knee plot"). See the kallisto | bustools tutorials for more details.
You don't really need to use that file to check performance/QC (especially since your output is unfiltered) -- you're better off loading your output into R or python, filtering your cells, and doing QC from there.