Question

merging .fastq.gz files results in corrupted files

0

Entering edit mode

5 months ago

rDNA ▴ 20

Hello,

I've downloaded paired-end Illumina sequencing files from a repository (.fq.gz). Some samples are sequenced in multiple runs. I therefore want to merge forward reads of all runs per sample, and the reverse reads of all runs per sample. In the end I thus want to end up with 1 forward and 1 reverse reads file per sample.

e.g. of sample 1 I have files:

1_HJTTNDRXX.1.fq.gz #fw run 1
1_HJTTNDRXX.2.fq.gz #rev run 1
1_HJVG7DRXX.1.fq.gz #fw run 2
1_HJVG7DRXX.2.fq.gz #rev run 2

To merge the samples, I use following script:

for sample in $(ls *.fq.gz | awk -F'_' '{print $1}' | sort | uniq); do #the ID before the first "_" is the sample name
  echo "Processing sample: $sample"

  # Merge .1 files for the sample = forward reads
  cat ${sample}_*.1.fq.gz > ./../merged_data_16S/${sample}.R1.fq.gz #the * depicts the run ID

  # Merge .2 files for the sample = reverse reads
  cat ${sample}_*.2.fq.gz > ./../merged_data_16S/${sample}.R2.fq.gz
done

This would yield for e.g. sample 1: 1.R1.fq.gz (merged fw reads sample 1) and 1.R2.fq.gz (merged rev reads sample 1).

However, after merging, some files turn out corrupted:

I get a BadZipFile error when importing into QIIME2.
When I run" for file in *.fq.gz; do gzip -t $file; done, for some files I get gzip: 190.R2.fq.gz: invalid compressed data--crc error or gzip: 242.R2.fq.gz: invalid compressed data--length error

When I run this for loop on the original unmerged files, no error occurs, which gives me the idea it results from the merging.

I think re-merging the files results in different files being corrupted, so this also would confirm that the original files are OK.

Why does this corruption happen? How can I accomplish correct merging of these files?

corrupt gzip • 774 views

ADD COMMENT • link 5 months ago by rDNA ▴ 20

1

Entering edit mode

As QIIME2 checks file integrity by comparing the recorded checksum with data, the cat command may not work. You can try zcat ${sample}_*.1.fq.gz| gzip -c > ./../merged_data_16S/${sample}.R1.fq.gz instead.

ADD REPLY • link 5 months ago by Arup Ghosh 3.3k

2

Entering edit mode

Another idea might be to just check the individual downloaded files before merging by gunzip and then pigz to re-gzip if all is ok.

ADD REPLY • link 5 months ago by colindaven 7.7k

1

Entering edit mode

gzip , not gunzip -c .

ADD REPLY • link 5 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

Sorry, my bad. I have corrected it.

ADD REPLY • link 5 months ago by Arup Ghosh 3.3k

0

Entering edit mode

thanks for the reply. I also used zcat ${sample}_*.1.fq.gz| gzip > ./../merged_data_16S/${sample}.R1.fq.gzbut this gave the same issue. However, i thus did not run gzip with parameter -c. " write on standard output, keep original files unchanged ". So perhaps option -c is necessary here? I'll try again. I also redownloaded the data (43 gigabyte). Before I just clicked on a download link and let the browser download it. Now I redownload it with wget, probably more stable.

ADD REPLY • link 5 months ago by rDNA ▴ 20

0

Entering edit mode

gunzip -c will not generate a GZ file.

First as said Arup Ghosh, test your original files:

gunzip -t /path/to/*.fq.gz

ADD REPLY • link 5 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre, I tested my original files using gzip -t file.fq.gz. This does behave similarly as what you propose gunzip -t file.fq.gz right? I know gzip is to compress, and gunzip to decompress, but what I mean is the similar behavior for testing the integrity using the -t flag.

ADD REPLY • link 5 months ago by rDNA ▴ 20