Question

Downsampling paired-end FASTQ files with a precise number of bases (not reads) - How to do it properly ?

0

Entering edit mode

4.5 years ago

mikizu • 0

Hi !

Firstly, I am only a Bioinformatics student, I am sorry if what I am saying isn't clear. I was asked to downsample one FASTQ file (File 1) to the same number of reads than another sample's FASTQ file (File 2) (already done), and after that to downsample this FASTQ File 1 to the same number of bases than the FASTQ File 2. These files are from Illumina paired-end DNA sequencing runs on a NextSeq 550.

For the context, these two samples are from the same yeast strain but are from 2 different sequencing runs using 2 different library preparation kits. These data will be used to do a draft de novo assembly in order to compare which file, at equal coverage, generates the best assembly (and then which library preparation kit would be better for this case).

File 1 (to downsample) : Read 1 = 909314392 bp / Read 2 = 908769646 bp
File 2 (targeted # of bases) : Read 1 = 754861054 bp / Read 2 = 755899824 bp

I haven't seen as much topics about downsampling by number of bases than about downsampling by number of reads. I have found this topic about using BBmap's reformat.sh to do it : Base count NOT read count - Downsampling to specific number of bases not reads?

In fact, I am kind of stuck because this is paired-end, I am not sure how to do it properly. I used reformat.sh with the sbt option and using read 1 and read 2 Fastq files (File 1) as input files but I can only give one number of bases to the sbt option, so I have choosen the File 2 Read 1 length = 754861054 bp :

reformat.sh  in=File1_R1.fastq  in2=File1_R2.fastq  out=File1_R1_subsampled.fastq  out2=File1_R1_subsampled.fastq  sbt=754861054

It resulted in two output files of the half of 754861054 bp : File1_R1_subsampled = 377540241 bp / File1_R2_subsampled = 377320812 bp. I supposed that I had to give the double of 754861054 bp (1509722108) to the sbt option in order to obtain 2 files of approximately 754861054 bp length. So this is what I've tried next :

reformat.sh  in=File1_R1.fastq  in2=File1_R2.fastq  out=File1_R1_subsampled.fastq  out2=File1_R1_subsampled.fastq  sbt=1509722108

Which gave me in the end : File1_R1_subsampled = 755092384 bp / File1_R2_subsampled = 754629644 bp. Which is better in term of number of bases but I am not sure if this is accurate. Since I've used this command with 2 input files, reformat.sh must have understood that it were paired FASTQ files.

I have also tried to it by downsampling separately the R1 file then the R2 file but, even if it returned the right number of bases in the end, I guess that they won't be paired anymore and it won't be useful for the next step which is the assembly if it is the case...

Does anyone know if the method I have used (second code with option sbt=1509722108) is a correct way to downsample paired-end FASTQ files with a precise number of bases or is it totally non-sense ?
I was told that I could do it manually but I am not sure how, is there any script permitting to remove X last bases from a FASTQ file (if this is how we do it manually, I am not sure about that neither) or to downsample by giving an exact number of bases ?

Thank you in advance, any help would really be appreciated.

ngs FASTQ downsampling bbmap • 3.7k views

ADD COMMENT • link 4.5 years ago by mikizu • 0

0

Entering edit mode

downsample this FASTQ File 1 to the same number of bases than the FASTQ File 2

Were you actually asked to do this or is that your interpretation?

Since you were asking BBMap to target 1509722108 bases that is what you got with File1_R1_subsampled = 755092384 bp + File1_R2_subsampled = 754629644 bp..

ADD REPLY • link 4.5 years ago by GenoMax 147k

0

Entering edit mode

The request was to get the same length in bases for both files indeed. Thank you for the confirmation.

ADD REPLY • link 4.5 years ago by mikizu • 0

0

Entering edit mode

It's the request that's nonsense. If the two downsampled files have within a few percent of the same number of bases, that's fine for comparing them to each other.

ADD REPLY • link 4.5 years ago by swbarnes2 14k

0

Entering edit mode

Indeed, I'll just compare them like that. It gave me an average coverage depth after the assembly of 128 instead of 155 for the downsampled files, which is comparable to the other files with an average coverage depth of the resulting assembly of 126.

ADD REPLY • link 4.5 years ago by mikizu • 0

0

Entering edit mode

For the context, these two samples are from the same yeast strain but are from 2 different sequencing runs using 2 different library preparation kits. These data will be used to do a draft de novo assembly in order to compare which file, at equal coverage, generates the best assembly (and then which library preparation kit would be better for this case).

A better way to do these comparisons would probably be to normalize the data using bbnorm.sh for the runs independently and then compare the results of assembly. A guide is available here.