I want to build a standardized quality control process for particular RNA-seq samples, so I intend to downsample
the alignment results (BAM
files) with multiple proportions.
I tried to use DownsampleSam
in Picard tools
, and my command was as follows.
for SAMPLE in sample1 sample2 sample3 sample4 sample5 sample6; do
parallel --env SAMPLE --keep-order -j 8 '
picard DownsampleSam \
I=${SAMPLE}/raw.sam.gz \
O=${SAMPLE}/{}downsampled.bam \
A=1.0E-5 \
P={}
' ::: 0.001 0.002 0.005 0.01 0.02 0.05 0.1 0.5
done
This command does what I expect, but it seems to be more time-consuming and computationally intensive.
Considering that hundreds of samples may be processed later, I wonder if there is a more optimized solution? For example, when doing P=0.5
, is it possible to output the results for P=0.001
, P=0.002
, P=0.005
, P=0.01
, P=0.02
, P=0.05
, P=0.1
, and P=0.2
together? That would save a lot of computing power and time if it were possible!
Thanks in advance!
May I ask what downsampling has to do with QC? As for the parallelization, if you want to further parallelize then wrap this snippet itself into parallel rather than a loop or submit an array of jobs if you are on a cluster that supports this, e.g. via SLURM.
I want to analyze the coverage of specially treated RNA by downsampling. When I do the
for
loop I usedbsub
to submit to multiple compute nodes ofhpcc
, but it is still a bit of a waste. I think downsampling should be able to get multiple scales of results at once.