Question

Merging two fastq.gz files

0

Entering edit mode

6.6 years ago

tcf.hcdg ▴ 70

Hello,

I have 96 *fastqc.gz raw read files from 24 samples. Each sample was sequenced on two lanees for each pair.

I would like to merge reads for each pair from both lanes into one output file with same name identifier from sample file name (2271_merged_R1_001.fastq.gz).

File names are in this order:
22[71-94]*R[1-2]_001.fastq.gz;

**2271**_ID890_1_S1_L001_**R1_001.fastq.gz**
**2271**_ID890_1_S1_L002_**R1_001.fastq.gz**

**2271**_ID890_1_S1_L001_**R2_001.fastq.gz**
**2271**_ID890_1_S1_L002_**R2_001.fastq.gz**

I tried the following short script but only two output files are being generated (first and the last).

FOR R1 files

  for rf in 22[71-94]*R1_001.fastq.gz; do zcat $rf > 22"${71-94}"_merged_R1_001.fastq.gz ; done

FOR R2 files

for rf in 22[71-94]*R2_001.fastq.gz; do zcat $rf > 22"${71-94}"_merged_R2_001.fastq.gz ; done

My Questions are: 1. Why only two output files are generated? 2. The number of reads in the out put files are not the sum of the merged files from both lanes. 3. Is there a nice way, I could do the merging of reads from both lanes for both (R1 and R2) in single step instead of running it two times for each read type.

What went wrong in the code? and how could I verify that the output files are completely merged?

Thanks

fastq merging • 6.4k views

ADD COMMENT • link updated 6.6 years ago by igor 13k • written 6.6 years ago by tcf.hcdg ▴ 70

0

Entering edit mode

For 48 files for R1, following code will work ( Take a back up of your work and try on 1-2 sets before using. Match MD5sums):

$ for i in   *1_R1_001.fastq.gz; do zcat ${i%%01*}01_R1_001.fastq.gz ${i%%01*}02_R1_001.fastq.gz| gzip -c - > ${i%%_*}_"merged_R"${i#*_R*} ; done

Works for R2 as well. Output file names would be: 2271_merged_R1_001.fastq.gz for 2271 R1.

ADD REPLY • link 6.6 years ago by cpad0112 21k

score 1 · Answer 1 · 2018-04-25

1

Entering edit mode

6.6 years ago

Pierre Lindenbaum 164k

not need to use gzcat, just use cat merge large amount of fastq files into a single one

ADD COMMENT • link 6.6 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2018-04-25

0

Entering edit mode

6.6 years ago

yhoogstrate ▴ 150

Is this what you're looking for maybe?:

for rf in 22[71-94]*R1_001.fastq.gz; do cat $rf >> 22"${71-94}"_merged_R1_001.fastq.gz ; done

zcat extracts, which is unnecessary as you dump it into a .gz file. Also, >> appends, > overwrites, of which appending seems what you need?

I hope this helps you a bit.

Enjoy,

Youri

ADD COMMENT • link 6.6 years ago by yhoogstrate ▴ 150

0

Entering edit mode

And What about " 1. Why only two output files are generated? "

ADD REPLY • link 6.6 years ago by tcf.hcdg ▴ 70

1

Entering edit mode

I used the following and it worked:

R1

for ((num=71; num<=94; num++)); { cat 22"$num"*{L001,L002}_R1_001.fastq.gz > "22${num}_merged_R1_001.fastq.gz" ;}

R2

for ((num=71; num<=94; num++)); { cat 22"$num"*{L001,L002}_R1_001.fastq.gz > "22${num}_merged_R1_001.fastq.gz" ;}

ADD REPLY • link 6.6 years ago by tcf.hcdg ▴ 70

score 0 · Answer 3 · 2018-04-25

If you are not sure what your code is doing, try checking what is actually happening. Instead of generating the final file blindly and hoping it is working properly, print the progress. For example, you can check which inputs are getting paired with which outputs:

for rf in 22[71-94]*R1_001.fastq.gz; do
  echo "$rf  to  22${71-94}_merged_R1_001.fastq.gz"
done