Hi Everyone,
I am looking for a solution to speed up UMI extraction from single-cell RNA-Seq data.
I am working with scRNA-Seq data that contains UMIs. The data I have contains > 100,000 single-cell samples.
To recover the UMIs I am using the following for loop in a bash script which I am running on a linux cluster computer. This approach utilizes the "extract" command for paired end reads from the UMI_Tools package. This command works as expected however it is very slow and will take an unacceptable amount of time to run on the full ~15gb .fastq file.
for cell in "${cells[@]}";
do
umi_tools extract -I results/$cell \
--read2-in=results/$cell.MATEPAIR \
--bc-pattern=NNNNNNNNNN \
--log=processed.log \
--stdout=results_UMI/$cell.read1.fastq \
--read2-out=results_UMI/$cell.read2.fastq
done
I am curious to know if anyone has any suggestions for a faster way to extract UMIs.
You mentioned you are running on a linux cluster. What job management system are you using? How many fastq files (elements in ${cells[@]})? If more than one,can you simply unroll your for loop, which runs each $cell serially, and execute a set of simultaneous jobs that will run in parallel, one per $cell? Potential speedup proportional to the number of files, depending on your cluster capacity.