Hello,
I've downloaded paired-end Illumina sequencing files from a repository (.fq.gz
). Some samples are sequenced in multiple runs. I therefore want to merge forward reads of all runs per sample, and the reverse reads of all runs per sample. In the end I thus want to end up with 1 forward and 1 reverse reads file per sample.
e.g. of sample 1 I have files:
1_HJTTNDRXX.1.fq.gz
#fw run 11_HJTTNDRXX.2.fq.gz
#rev run 11_HJVG7DRXX.1.fq.gz
#fw run 21_HJVG7DRXX.2.fq.gz
#rev run 2
To merge the samples, I use following script:
for sample in $(ls *.fq.gz | awk -F'_' '{print $1}' | sort | uniq); do #the ID before the first "_" is the sample name
echo "Processing sample: $sample"
# Merge .1 files for the sample = forward reads
cat ${sample}_*.1.fq.gz > ./../merged_data_16S/${sample}.R1.fq.gz #the * depicts the run ID
# Merge .2 files for the sample = reverse reads
cat ${sample}_*.2.fq.gz > ./../merged_data_16S/${sample}.R2.fq.gz
done
This would yield for e.g. sample 1: 1.R1.fq.gz
(merged fw reads sample 1) and 1.R2.fq.gz
(merged rev reads sample 1).
However, after merging, some files turn out corrupted:
- I get a
BadZipFile error
when importing into QIIME2. - When I run"
for file in *.fq.gz; do gzip -t $file; done
, for some files I getgzip: 190.R2.fq.gz: invalid compressed data--crc error
orgzip: 242.R2.fq.gz: invalid compressed data--length error
When I run this for loop on the original unmerged files, no error occurs, which gives me the idea it results from the merging.
I think re-merging the files results in different files being corrupted, so this also would confirm that the original files are OK.
Why does this corruption happen? How can I accomplish correct merging of these files?
As QIIME2 checks file integrity by comparing the recorded checksum with data, the
cat
command may not work. You can tryzcat ${sample}_*.1.fq.gz| gzip -c > ./../merged_data_16S/${sample}.R1.fq.gz
instead.Another idea might be to just check the individual downloaded files before merging by gunzip and then pigz to re-gzip if all is ok.
gzip , not gunzip -c .
Sorry, my bad. I have corrected it.
thanks for the reply. I also used
zcat ${sample}_*.1.fq.gz| gzip > ./../merged_data_16S/${sample}.R1.fq.gz
but this gave the same issue. However, i thus did not rungzip
with parameter-c
. " write on standard output, keep original files unchanged ". So perhaps option-c
is necessary here? I'll try again. I also redownloaded the data (43 gigabyte). Before I just clicked on a download link and let the browser download it. Now I redownload it withwget
, probably more stable.gunzip -c will not generate a GZ file.
First as said Arup Ghosh, test your original files:
@Pierre, I tested my original files using
gzip -t file.fq.gz
. This does behave similarly as what you proposegunzip -t file.fq.gz
right? I knowgzip
is to compress, andgunzip
to decompress, but what I mean is the similar behavior for testing the integrity using the-t
flag.