Cell ranger is the most used software to quantify gene expression of single cell in 10x library. But most of my data do not have standard file format like _R1_L001.fastq
and _R2_L001.fastq
and they have the same read length (which is very common in many studies). Thus I extracted UMI using UMI_tools whitelist
and UMI_tools extract
. The R2 fastq was then aligned to reference genome.
According to the UMI_tools tutorial, I need to use featureCount to assign reads to gene, and count UMI per cell using UMI_tools count
.
But there are two shortages about the tool. First, umitools does not support multithread function, which is VERY time-consuming. Second, it is storage killer. It requires featureCount to create a new bam file with an additional new tag. Then I have to sort the bam, which can also be time-consuming if there are too many bam files. Finally, I need to take the sorted bam to UMI_tools count
. It will eventually generate the count matrix. The whole pipeline will triple the occupied storage of bam files in the disk. It is a disaster for me and I really need to save some space.
May I ask if there is any other method to quantify gene expression faster and more convenient in my case? It would be so kind of you to give me a hint. Thanks.
Don’t reinvent the wheel. You would need to build an entire custom pipeline. Either rename the fastq files (e.g. symlink them first, then rename) or use any other pipeline such as salmon-alevin.
Thanks for your kindly advice. I was thinking if I can reuse the aligned bam. It appears that a better option is to rename the original fastq and run cellranger.
Probably yes, be sure to do the renaming via a script to have it reproducible and track which file gets which name in some kind of a log file.