Question

Synchronization of fastq files

0

Entering edit mode

9.7 years ago

Antonio R. Franco ★ 5.2k

I downloaded paired-end Illumina reads from the NCBI-SRA, and run fastq-dump --split-3 to get a legacy extraction of the corresponding fastq files

I ended with three files. The file_1.fastq.gz, file_2.fastq.gz and a third file.fastq.gz. The third one corresponds to 492919 files whose readlen < 1

Sizes of these fastq.gz files are huge. A simple counting of lanes takes too long to be accomplished. A test to extract and compare the order of the names and coordinates' read sequences will take even a longer time

So I rather ask here for previous experiences..

Should I understand that name_1.fastq and name_2.fastq are synchronized files ?, that is, are the left and right reads are in the same order ?. I ask this because the size difference between the two files (the _1 and the _2) is notable
Is there any script that will allow me to synchronize these two files in case that I need it?

Assembly velvet • 4.2k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.7 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

I answer to myself

Both files, file_1.fastq.gz and file_2.fastq.gz have at least the same number of lanes

ADD REPLY • link 9.7 years ago by Antonio R. Franco ★ 5.2k

score 1 · Answer 1 · 2015-11-14

I've never seen an SRA file where the reads were out of sync, though I suppose it could in theory happen. There's a convenient tool from BBTools ( reformat.sh, I think) to resync things should you ever need to do so (note, I wouldn't bother checking the results of fastq-dump unless you go obviously weird results from mapping/assembly).

Ram · Answer 2 · 2015-11-14

0

Entering edit mode

9.7 years ago

piet ★ 1.9k

I ask this because the size difference between the two files (the _1 and the _2) is notable

The difference in size usually results from gzip. If all residues in a read have exactly the same quality, compression by gzip is more efficient as if the quality values are spread over a large range. You should better compare the size of the unzipped files.

Sizes of these fastq.gz files are huge. A simple counting of lanes takes too long to be accomplished.

You can use the wc command to count the number of lines of your fastq files:

zcat file_1.fastq.gz | wc
zcat file_2.fastq.gz | wc

The fastq-dump program emits a variant of FASTQ formatted files, where four lines make up a read.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.7 years ago by piet ★ 1.9k

2

Entering edit mode

I think you'll want wc -l to count lines.

ADD REPLY • link 9.7 years ago by h.mon 35k

0

Entering edit mode

No, I recommend to look at the number of lines, the number of words, and the number of characters: all three numbers at once.

ADD REPLY • link 9.7 years ago by piet ★ 1.9k