I was able to process a Smart-Seq2 dataset with the following code:
STAR --runThreadN $CPUS \ --genomeDir /path/to/dir/ \ --readFilesCommand gunzip -c \ --outFileNamePrefix /path/to/dir/ \ --soloType SmartSeq \ --readFilesManifest /path/to/tsv \ --soloUMIdedup Exact NoDedup \ --outSAMtype BAM SortedByCoordinate \ --soloStrand Unstranded \ --outBAMsortingBinsN 200
I need to work with the BAM file "Aligned.sortedByCoord.out.bam" for some downstream analyses. Given that the "CB" flag isn't applicable to Smart-Seq2 data, it's not clear how I can determine which read is associated with which cell. For example, here is one alignment from the BAM file:
LH00244:248:22VJNCLT3:4:1141:9968:28382 99 chr1 3000720 255 118M33S = 3000866 297 CTTTATTTCATCATTGACCAAGCTATCATTAAGTAGAGTATTGTTCCGTTTCCAAGTGAACGTTTGCTTTCTCTTATTTCTGCTGTTCTTTAAGATCAGCCTTCGTCCGTAGTTCTCTTAAAAGATGCACGGGAAAACTTCCATATTTTTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9II9III9I**I999I99II9II99999**9999**9999999**9999999999999999999999*99999* NH:i:1 HI:i:1 AS:i:247 nM:i:10
Any suggestions on how I can add this information or parse the file into separate BAM files per cell?
Based on your use of UMI's you appear to have smartseq v.3 data.
Since this is single cell data I don't know if plain
STAR
is appropriate to use here. You may need to useSTARsolo
instead.You could also use
kallisto
(Analysis of Smart-Seq3 data with kallisto-bustools ) oralevin-fry
(https://combine-lab.github.io/alevin-fry-tutorials/2021/sci-rna-seq3/ ).