Question

Trim Reads In Bz2 Fastq Files

2

Entering edit mode

12.4 years ago

Vikas Bansal ★ 2.4k

Hi,

I have 20 bunzip2 fastq files. Each compressed fastq file is ~4gb and reads are 50bp long. I want to trim reads to 36bp in compressed files.

I have tried bioawk but it does not accept bz2 files.

I tried FASTX-toolkit, but it also does not accept bz2 files but then I tried -

bzcat input.fastq.bz2 | fastx_trimmer -l 36 -i - | gzip > trimmed.fastq.gz

The reason I have used gzip (or may be pigz for making it more fast) here because my next step is mapping using BWA and it does not accept bz2 files but do accept gz files.

The above code works but for each file it is taking around 30 minutes. If I don't use gzip in above code then it takes about 22 minutes for each file but then files have large size. For 20 files, it is going to take lot of time and in future I will be receiving 40-45 files like this.

Can anyone please suggest me an alternative way which is efficient and not time consuming?

fastq trimming • 5.6k views

ADD COMMENT • link updated 12.4 years ago by Weronika ▴ 300 • written 12.4 years ago by Vikas Bansal ★ 2.4k

1

Entering edit mode

Why don't you write a shell script to execute the trimming on all fastq files in parallel? The effective time is then 30 minutes...

ADD REPLY • link 12.4 years ago by Arun 2.4k

0

Entering edit mode

Yes, you are right. I can do that. So is this the right way or we have some other efficient solution because I have a bad feeling that I am doing something wrong with bzcat and then compressing it again using gzip and creating new file.

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

if your command (the one you have shown) works nicely, and you've got a cluster waiting for chunks of data to swallow and spit, then why not? :)

ADD REPLY • link 12.4 years ago by Arun 2.4k

score 1 · Answer 1 · 2012-07-09

1

Entering edit mode

12.4 years ago

Weronika ▴ 300

Looks to me like you're basically doing it right - you have to use bzcat and then gzip to convert from .bz2 to .gz files, and since bowtie takes .gz files, that seems to be the best way to go. Doing it in parallel on multiple files (see the other answer/comments) sounds like a good idea. In general, if you're dealing with a lot of deep-sequencing data, you have to expect the processing to take a while.

Is there any chance of asking your data provider to give you .gz instead of .bz2 files next time?

ADD COMMENT • link 12.4 years ago by Weronika ▴ 300

0

Entering edit mode

Thanks. I am doing the same thing now. I thought may be there is some trick that we can trim reads in compressed file without making new compressed file.

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

score 0 · Answer 2 · 2012-07-09

Just in case, here's a sample script I use to execute my perl script on all files in a particular folder simultaneously (& at the end of the perl script). The set command at the top is borrowed from websites/forums and I don't remember what exactly it is for. But it does the trick for me neatly!

#!/bin/bash
set -o errexit
BAM_PATH="$1"
OUT_PATH="$2"
for BAM in $BAM_PATH/*.bam
    do
        if [ -f $BAM ]
        then
            perl myscript.pl $BAM $OUT_PATH & 
        fi
    done
trap "kill 0" SIGINT SIGTERM EXIT
wait