Faster UMI extraction from scRNA-Seq data
1
0
Entering edit mode
6.2 years ago
paulranum11 ▴ 80

Hi Everyone,

I am looking for a solution to speed up UMI extraction from single-cell RNA-Seq data.

I am working with scRNA-Seq data that contains UMIs. The data I have contains > 100,000 single-cell samples.

To recover the UMIs I am using the following for loop in a bash script which I am running on a linux cluster computer. This approach utilizes the "extract" command for paired end reads from the UMI_Tools package. This command works as expected however it is very slow and will take an unacceptable amount of time to run on the full ~15gb .fastq file.

for cell in "${cells[@]}";
do
    umi_tools extract -I results/$cell \
    --read2-in=results/$cell.MATEPAIR \
    --bc-pattern=NNNNNNNNNN \
    --log=processed.log \
    --stdout=results_UMI/$cell.read1.fastq \
    --read2-out=results_UMI/$cell.read2.fastq
done

I am curious to know if anyone has any suggestions for a faster way to extract UMIs.

RNA-Seq scRNA-Seq UMI umi_tools • 3.0k views
ADD COMMENT
1
Entering edit mode

You mentioned you are running on a linux cluster. What job management system are you using? How many fastq files (elements in ${cells[@]})? If more than one,can you simply unroll your for loop, which runs each $cell serially, and execute a set of simultaneous jobs that will run in parallel, one per $cell? Potential speedup proportional to the number of files, depending on your cluster capacity.

ADD REPLY
3
Entering edit mode
6.2 years ago

You could use GNU parallel:

parallel -j 6 'umi_tools extract -I results/{} --read2-in=results/{}.MATEPAIR --bc-pattern=NNNNNNNNNN --log=processed.log --stdout=results_UMI/{}.read1.fastq --read2-out=results_UMI/{}.read2.fastq' ::: path/to/your/files/*.fileending

Replace 6 above with the number of CPUs you want to use on your server. I'm assuming here you have a nice big one with dozens of cores.

Replace path/to/your/files/*.fileextension above with the location of your elements in ${cells[@]}. If they are split up in folder just symlink them all into one folder alone (and then instead you could just use * with no file ending).

Replace {} with {.} if you just want to use the file basename. Also read the parallel manual.

PS: Watch out for the potential of using to much I/O (depends on the umi_tools design).

PPS: Looks like based on your command that the logs could get weird.

ADD COMMENT
0
Entering edit mode

Thanks for this solution it did result in a speed increase. One interesting note for future readers is that the speed increase I experienced did not appear to be directly linear with the number of cores added. Using a test dataset of 50,000 reads from 1737 cells I ran this parallelized UMI extraction on 6 cores. The 6 core runtime was 20 minutes. The same test dataset run with 12 cores took 18 minutes to complete. This makes me think that part of the slowdown is related to the overhead of loading umi_tools separately for each individual cell .fastq file.

ADD REPLY

Login before adding your answer.

Traffic: 1561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6