Automating to convert multiple fastq files into one fastq file
1
0
Entering edit mode
6.0 years ago
zhou_1228 • 0

I got six fastq files (three forward and three reverse) for every sample in 96-well plate via NGS. As the first step of SNP calling, I need convert these six files into two files (forward and reverse fastq file) for each sample. Now I am trying to write a shell scripts to automatically merge every three forward (reverse) files into one for multiple samples. Below is the scripts I wrote to automatically convert one fastq file to one bwa file for many samples, but I am asking the scripts to convert three files into one. Thank you.

for fq in ~/NGS/*.fastq
    do
    echo "working with file $fq"

    base=$(basename $fq .fastq)
    echo "base name is $base"

    bwa=~/results/bwa/${base}.bwa

    bwa aln -t 4 GMbwaidx $fq > $bwa
    done

My six files for one sample look like this:

142_P001_WB01_S1751_L008_R1_001.fastq
143_P001_WB01_S13_L001_R1_001.fastq
143_P001_WB01_S13_L002_R1_001.fastq

142_P001_WB01_S1751_L008_R2_001.fastq
143_P001_WB01_S13_L001_R2_001.fastq
143_P001_WB01_S13_L002_R2_001.fastq
SNP sequencing • 2.5k views
ADD COMMENT
1
Entering edit mode

Hello zhou_1228,

142_P001_WB01_S1751_L008_R1_001.fastq
143_P001_WB01_S13_L001_R1_001.fastq

and how do you know that these files belong to the same sample? Which part of the filename give that information?

fin swimmer

ADD REPLY
0
Entering edit mode

P001_WB01 represent plate 1, well No. B1

ADD REPLY
0
Entering edit mode

convert one fastq file to one bwa file

There is no bwa file (format), bwa outputs alignments in the SAM format. For this reason, I would write:

bwa=~/results/bwa/${base}.sam
ADD REPLY
2
Entering edit mode
6.0 years ago
Malcolm.Cook ★ 1.5k

Install and use GNU Parallel.

Then use the following model. Remove --dry when you're ready to run:

nPlates=3
nWells=4
parallel -k --dry 'bwa aln -t 4 GMbwaidx <(cat NGS/*{1}_{2}*.fastq) > {1}_{2}.sam' :::: <( seq -f 'P%03g' ${nPlates} ) <(seq -f 'WB%02g' ${nWells} )
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB01*.fastq) > P001_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB02*.fastq) > P001_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB03*.fastq) > P001_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P001_WB04*.fastq) > P001_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB01*.fastq) > P002_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB02*.fastq) > P002_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB03*.fastq) > P002_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P002_WB04*.fastq) > P002_WB04.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB01*.fastq) > P003_WB01.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB02*.fastq) > P003_WB02.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB03*.fastq) > P003_WB03.sam
bwa aln -t 4 GMbwaidx <(cat NGS/*P003_WB04*.fastq) > P003_WB04.sam

Notes:

The above approach

  • depends upon bwa's ability to stream input
  • works with any number of fastq files per plate_well combination.
  • is not using parallel's ability to run multiple jobs, since presumably you have -t 4 threads available to you
  • assumes your shell is bash, and depends upon its capability for Process Substitution
ADD COMMENT
0
Entering edit mode

From my understanding, you run two commands, cat and bwa together in your model. But in my case, for each sample, I firstly need merge three R1.fastq files into one -F.fastq and another three R2.fastq to one -R.fastq, separately. And then run command "bwa mem GMbwaidx -F.fastq -R.fastq > *.sam" to generate sam file. Do you have any suggestion to automatically run these two steps?

ADD REPLY
0
Entering edit mode

Sure. The approach is the same; you just need two calls to cat, using slightly different file wildcarding (aka globbing) in each. Also, I now realize your well identifier has a row and a column component. In this updated example, for brevity, I limit to the first three rows, A through C, and the first two zero-padded columns, 01 through 02:

plate=$(seq -f 'P%03g' 3)
row=$(echo {A..C})
col=$(seq -f '%02g' 2 )
parallel -k --dry 'bwa mem GMbwaidx <(cat NGS/*{1}_W{2}{3}*_R1_*.fastq) <(cat NGS/*{1}_W{2}{3}*_R2_*.fastq) > {1}_W{2}{3}.sam' ::: $plate ::: $row ::: $col
bwa mem GMbwaidx <(cat NGS/*P001_WA01*_R1_*.fastq) <(cat NGS/*P001_WA01*_R2_*.fastq) > P001_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WA02*_R1_*.fastq) <(cat NGS/*P001_WA02*_R2_*.fastq) > P001_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB01*_R1_*.fastq) <(cat NGS/*P001_WB01*_R2_*.fastq) > P001_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WB02*_R1_*.fastq) <(cat NGS/*P001_WB02*_R2_*.fastq) > P001_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC01*_R1_*.fastq) <(cat NGS/*P001_WC01*_R2_*.fastq) > P001_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P001_WC02*_R1_*.fastq) <(cat NGS/*P001_WC02*_R2_*.fastq) > P001_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA01*_R1_*.fastq) <(cat NGS/*P002_WA01*_R2_*.fastq) > P002_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WA02*_R1_*.fastq) <(cat NGS/*P002_WA02*_R2_*.fastq) > P002_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB01*_R1_*.fastq) <(cat NGS/*P002_WB01*_R2_*.fastq) > P002_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WB02*_R1_*.fastq) <(cat NGS/*P002_WB02*_R2_*.fastq) > P002_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC01*_R1_*.fastq) <(cat NGS/*P002_WC01*_R2_*.fastq) > P002_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P002_WC02*_R1_*.fastq) <(cat NGS/*P002_WC02*_R2_*.fastq) > P002_WC02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA01*_R1_*.fastq) <(cat NGS/*P003_WA01*_R2_*.fastq) > P003_WA01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WA02*_R1_*.fastq) <(cat NGS/*P003_WA02*_R2_*.fastq) > P003_WA02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB01*_R1_*.fastq) <(cat NGS/*P003_WB01*_R2_*.fastq) > P003_WB01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WB02*_R1_*.fastq) <(cat NGS/*P003_WB02*_R2_*.fastq) > P003_WB02.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC01*_R1_*.fastq) <(cat NGS/*P003_WC01*_R2_*.fastq) > P003_WC01.sam
bwa mem GMbwaidx <(cat NGS/*P003_WC02*_R1_*.fastq) <(cat NGS/*P003_WC02*_R2_*.fastq) > P003_WC02.sam

As rewritten, the approach

ADD REPLY
0
Entering edit mode

Thank you so much for your reply. I found that there are many GNU parallel package for downloading. My OS is Linux Mint 18.1, so which one I should download? Thank you.

ADD REPLY
0
Entering edit mode

I can not help you much more than to say to install the latest version of Gnu parallel that is packaged for your operating system distribution.

probably install with:

sudo apt-get install parallel

but best to follow you OS documentation, possibly such as: Installing softwares

or, for hints, https://www.gnu.org/software/parallel/

ADD REPLY
0
Entering edit mode

I got it. Thank you so much.

ADD REPLY
0
Entering edit mode

Great - glad to help - please upvote and accept the answer!

ADD REPLY

Login before adding your answer.

Traffic: 2670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6