merging .fastq.gz files results in corrupted files
0
0
Entering edit mode
4 weeks ago
rDNA ▴ 20

Hello,

I've downloaded paired-end Illumina sequencing files from a repository (.fq.gz). Some samples are sequenced in multiple runs. I therefore want to merge forward reads of all runs per sample, and the reverse reads of all runs per sample. In the end I thus want to end up with 1 forward and 1 reverse reads file per sample.

e.g. of sample 1 I have files:

  • 1_HJTTNDRXX.1.fq.gz #fw run 1
  • 1_HJTTNDRXX.2.fq.gz #rev run 1
  • 1_HJVG7DRXX.1.fq.gz #fw run 2
  • 1_HJVG7DRXX.2.fq.gz #rev run 2

To merge the samples, I use following script:

for sample in $(ls *.fq.gz | awk -F'_' '{print $1}' | sort | uniq); do #the ID before the first "_" is the sample name
  echo "Processing sample: $sample"

  # Merge .1 files for the sample = forward reads
  cat ${sample}_*.1.fq.gz > ./../merged_data_16S/${sample}.R1.fq.gz #the * depicts the run ID

  # Merge .2 files for the sample = reverse reads
  cat ${sample}_*.2.fq.gz > ./../merged_data_16S/${sample}.R2.fq.gz
done

This would yield for e.g. sample 1: 1.R1.fq.gz (merged fw reads sample 1) and 1.R2.fq.gz (merged rev reads sample 1).

However, after merging, some files turn out corrupted:

  1. I get a BadZipFile error when importing into QIIME2.
  2. When I run" for file in *.fq.gz; do gzip -t $file; done, for some files I get gzip: 190.R2.fq.gz: invalid compressed data--crc error or gzip: 242.R2.fq.gz: invalid compressed data--length error

When I run this for loop on the original unmerged files, no error occurs, which gives me the idea it results from the merging.

I think re-merging the files results in different files being corrupted, so this also would confirm that the original files are OK.

Why does this corruption happen? How can I accomplish correct merging of these files?

corrupt gzip • 432 views
ADD COMMENT
1
Entering edit mode

As QIIME2 checks file integrity by comparing the recorded checksum with data, the cat command may not work. You can try zcat ${sample}_*.1.fq.gz| gzip -c > ./../merged_data_16S/${sample}.R1.fq.gz instead.

ADD REPLY
2
Entering edit mode

Another idea might be to just check the individual downloaded files before merging by gunzip and then pigz to re-gzip if all is ok.

ADD REPLY
1
Entering edit mode

gzip , not gunzip -c .

ADD REPLY
0
Entering edit mode

Sorry, my bad. I have corrected it.

ADD REPLY
0
Entering edit mode

thanks for the reply. I also used zcat ${sample}_*.1.fq.gz| gzip > ./../merged_data_16S/${sample}.R1.fq.gzbut this gave the same issue. However, i thus did not run gzip with parameter -c. " write on standard output, keep original files unchanged ". So perhaps option -c is necessary here? I'll try again. I also redownloaded the data (43 gigabyte). Before I just clicked on a download link and let the browser download it. Now I redownload it with wget, probably more stable.

ADD REPLY
0
Entering edit mode

gunzip -c will not generate a GZ file.

First as said Arup Ghosh, test your original files:

gunzip -t /path/to/*.fq.gz 
ADD REPLY
0
Entering edit mode

@Pierre, I tested my original files using gzip -t file.fq.gz. This does behave similarly as what you propose gunzip -t file.fq.gz right? I know gzip is to compress, and gunzip to decompress, but what I mean is the similar behavior for testing the integrity using the -t flag.

ADD REPLY

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6