Question

Increase speed of Illumina read matepair finding bash

0

Entering edit mode

6.2 years ago

paulranum11 ▴ 80

Hi Everyone,

I am using a bash script to demultiplex and RNA-Seq reads generated with SPLiT-Seq (a single-cell RNA-Seq approach). There is a previous post describing this strategy here SPLiT-Seq Demultiplexing a bash tool for extraction of barcoded single cells.

This script demultiplexes reads tagged with successive rounds of indexing. One critical step in demultiplexing uses read IDs to find read mate pairs. This step works... but is much slower than needed for processing large ~18GB .fastq files containing thousands of cells.

Below piece of the script perform the matepair finding part of this script.

# Now we need to collect the other read pair. To do this we can collect read IDs from the results files we generated in step one.

# Generate an array of cell filenames
declare -a cells=( $(ls results/) )

# Parallelize mate pair finding
for cell in "${cells[@]}";
do 

# Grep for only the first word and add match to an array.
declare -a readID=( $(grep -Eo '^@[^ ]+' results/$cell) )


    grepfunction2() {
    grep -F -A 3 "$1 " $2 | sed '/^--/d'
    }
    export -f grepfunction2

    {
    parallel -j $numcores "grepfunction2 {} $FASTQ_F >> results/$cell.MATEPAIR" ::: "${readID[@]}" # Write the mate paired reads to a file
    } &> /dev/null
done

I am wondering if anyone can spot obvious flaws in this approach or has any suggestions for optimizing this approach for use on large .fastq files.

RNA-Seq sequencing • 841 views

ADD COMMENT • link 6.2 years ago by paulranum11 ▴ 80

1

Entering edit mode

The real answer to this is likely to be “don’t do it in bash”. Shell scripting isn’t famed for its speed.

ADD REPLY • link 6.2 years ago by Joe 21k