Question

Faster UMI extraction from scRNA-Seq data

0

Entering edit mode

6.8 years ago

paulranum11 ▴ 80

Hi Everyone,

I am looking for a solution to speed up UMI extraction from single-cell RNA-Seq data.

I am working with scRNA-Seq data that contains UMIs. The data I have contains > 100,000 single-cell samples.

To recover the UMIs I am using the following for loop in a bash script which I am running on a linux cluster computer. This approach utilizes the "extract" command for paired end reads from the UMI_Tools package. This command works as expected however it is very slow and will take an unacceptable amount of time to run on the full ~15gb .fastq file.

for cell in "${cells[@]}";
do
    umi_tools extract -I results/$cell \
    --read2-in=results/$cell.MATEPAIR \
    --bc-pattern=NNNNNNNNNN \
    --log=processed.log \
    --stdout=results_UMI/$cell.read1.fastq \
    --read2-out=results_UMI/$cell.read2.fastq
done

I am curious to know if anyone has any suggestions for a faster way to extract UMIs.

RNA-Seq scRNA-Seq UMI umi_tools • 3.6k views

ADD COMMENT • link updated 6.8 years ago by benformatics 4.1k • written 6.8 years ago by paulranum11 ▴ 80

1

Entering edit mode

You mentioned you are running on a linux cluster. What job management system are you using? How many fastq files (elements in ${cells[@]})? If more than one,can you simply unroll your for loop, which runs each $cell serially, and execute a set of simultaneous jobs that will run in parallel, one per $cell? Potential speedup proportional to the number of files, depending on your cluster capacity.

ADD REPLY • link 6.8 years ago by Ahill ★ 2.0k

score 3 · Accepted Answer · 2018-09-17

You could use GNU parallel:

parallel -j 6 'umi_tools extract -I results/{} --read2-in=results/{}.MATEPAIR --bc-pattern=NNNNNNNNNN --log=processed.log --stdout=results_UMI/{}.read1.fastq --read2-out=results_UMI/{}.read2.fastq' ::: path/to/your/files/*.fileending

Replace 6 above with the number of CPUs you want to use on your server. I'm assuming here you have a nice big one with dozens of cores.

Replace path/to/your/files/*.fileextension above with the location of your elements in ${cells[@]}. If they are split up in folder just symlink them all into one folder alone (and then instead you could just use * with no file ending).

Replace {} with {.} if you just want to use the file basename. Also read the parallel manual.

PS: Watch out for the potential of using to much I/O (depends on the umi_tools design).

PPS: Looks like based on your command that the logs could get weird.