The reason I have used gzip (or may be pigz for making it more fast) here because my next step is mapping using BWA and it does not accept bz2 files but do accept gz files.
The above code works but for each file it is taking around 30 minutes. If I don't use gzip in above code then it takes about 22 minutes for each file but then files have large size. For 20 files, it is going to take lot of time and in future I will be receiving 40-45 files like this.
Can anyone please suggest me an alternative way which is efficient and not time consuming?
Yes, you are right. I can do that. So is this the right way or we have some other efficient solution because I have a bad feeling that I am doing something wrong with bzcat and then compressing it again using gzip and creating new file.
Looks to me like you're basically doing it right - you have to use bzcat and then gzip to convert from .bz2 to .gz files, and since bowtie takes .gz files, that seems to be the best way to go. Doing it in parallel on multiple files (see the other answer/comments) sounds like a good idea. In general, if you're dealing with a lot of deep-sequencing data, you have to expect the processing to take a while.
Is there any chance of asking your data provider to give you .gz instead of .bz2 files next time?
Thanks. I am doing the same thing now. I thought may be there is some trick that we can trim reads in compressed file without making new compressed file.
Just in case, here's a sample script I use to execute my perl script on all files in a particular folder simultaneously (& at the end of the perl script). The set command at the top is borrowed from websites/forums and I don't remember what exactly it is for. But it does the trick for me neatly!
#!/bin/bash
set -o errexit
BAM_PATH="$1"
OUT_PATH="$2"
for BAM in $BAM_PATH/*.bam
do
if [ -f $BAM ]
then
perl myscript.pl $BAM $OUT_PATH &
fi
done
trap "kill 0" SIGINT SIGTERM EXIT
wait
Why don't you write a shell script to execute the trimming on all fastq files in parallel? The effective time is then 30 minutes...
Yes, you are right. I can do that. So is this the right way or we have some other efficient solution because I have a bad feeling that I am doing something wrong with bzcat and then compressing it again using gzip and creating new file.
if your command (the one you have shown) works nicely, and you've got a cluster waiting for chunks of data to swallow and spit, then why not? :)