Hi Everyone,
I am using a bash script to demultiplex and RNA-Seq reads generated with SPLiT-Seq (a single-cell RNA-Seq approach). There is a previous post describing this strategy here SPLiT-Seq Demultiplexing a bash tool for extraction of barcoded single cells.
This script demultiplexes reads tagged with successive rounds of indexing. One critical step in demultiplexing uses read IDs to find read mate pairs. This step works... but is much slower than needed for processing large ~18GB .fastq files containing thousands of cells.
Below piece of the script perform the matepair finding part of this script.
# Now we need to collect the other read pair. To do this we can collect read IDs from the results files we generated in step one.
# Generate an array of cell filenames
declare -a cells=( $(ls results/) )
# Parallelize mate pair finding
for cell in "${cells[@]}";
do
# Grep for only the first word and add match to an array.
declare -a readID=( $(grep -Eo '^@[^ ]+' results/$cell) )
grepfunction2() {
grep -F -A 3 "$1 " $2 | sed '/^--/d'
}
export -f grepfunction2
{
parallel -j $numcores "grepfunction2 {} $FASTQ_F >> results/$cell.MATEPAIR" ::: "${readID[@]}" # Write the mate paired reads to a file
} &> /dev/null
done
I am wondering if anyone can spot obvious flaws in this approach or has any suggestions for optimizing this approach for use on large .fastq files.
The real answer to this is likely to be “don’t do it in bash”. Shell scripting isn’t famed for its speed.