Increase speed of Illumina read matepair finding bash
0
0
Entering edit mode
6.2 years ago
paulranum11 ▴ 80

Hi Everyone,

I am using a bash script to demultiplex and RNA-Seq reads generated with SPLiT-Seq (a single-cell RNA-Seq approach). There is a previous post describing this strategy here SPLiT-Seq Demultiplexing a bash tool for extraction of barcoded single cells.

This script demultiplexes reads tagged with successive rounds of indexing. One critical step in demultiplexing uses read IDs to find read mate pairs. This step works... but is much slower than needed for processing large ~18GB .fastq files containing thousands of cells.

Below piece of the script perform the matepair finding part of this script.

# Now we need to collect the other read pair. To do this we can collect read IDs from the results files we generated in step one.

# Generate an array of cell filenames
declare -a cells=( $(ls results/) )

# Parallelize mate pair finding
for cell in "${cells[@]}";
do 

# Grep for only the first word and add match to an array.
declare -a readID=( $(grep -Eo '^@[^ ]+' results/$cell) )


    grepfunction2() {
    grep -F -A 3 "$1 " $2 | sed '/^--/d'
    }
    export -f grepfunction2

    {
    parallel -j $numcores "grepfunction2 {} $FASTQ_F >> results/$cell.MATEPAIR" ::: "${readID[@]}" # Write the mate paired reads to a file
    } &> /dev/null
done

I am wondering if anyone can spot obvious flaws in this approach or has any suggestions for optimizing this approach for use on large .fastq files.

RNA-Seq sequencing • 841 views
ADD COMMENT
1
Entering edit mode

The real answer to this is likely to be “don’t do it in bash”. Shell scripting isn’t famed for its speed.

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6