Hi !
Firstly, I am only a Bioinformatics student, I am sorry if what I am saying isn't clear. I was asked to downsample one FASTQ file (File 1) to the same number of reads than another sample's FASTQ file (File 2) (already done), and after that to downsample this FASTQ File 1 to the same number of bases than the FASTQ File 2. These files are from Illumina paired-end DNA sequencing runs on a NextSeq 550.
For the context, these two samples are from the same yeast strain but are from 2 different sequencing runs using 2 different library preparation kits. These data will be used to do a draft de novo assembly in order to compare which file, at equal coverage, generates the best assembly (and then which library preparation kit would be better for this case).
- File 1 (to downsample) : Read 1 = 909314392 bp / Read 2 = 908769646 bp
- File 2 (targeted # of bases) : Read 1 = 754861054 bp / Read 2 = 755899824 bp
I haven't seen as much topics about downsampling by number of bases than about downsampling by number of reads. I have found this topic about using BBmap's reformat.sh
to do it : Base count NOT read count - Downsampling to specific number of bases not reads?
In fact, I am kind of stuck because this is paired-end, I am not sure how to do it properly. I used reformat.sh with the sbt option and using read 1 and read 2 Fastq files (File 1) as input files but I can only give one number of bases to the sbt option, so I have choosen the File 2 Read 1 length = 754861054 bp :
reformat.sh in=File1_R1.fastq in2=File1_R2.fastq out=File1_R1_subsampled.fastq out2=File1_R1_subsampled.fastq sbt=754861054
It resulted in two output files of the half of 754861054 bp : File1_R1_subsampled = 377540241 bp / File1_R2_subsampled = 377320812 bp. I supposed that I had to give the double of 754861054 bp (1509722108) to the sbt option in order to obtain 2 files of approximately 754861054 bp length. So this is what I've tried next :
reformat.sh in=File1_R1.fastq in2=File1_R2.fastq out=File1_R1_subsampled.fastq out2=File1_R1_subsampled.fastq sbt=1509722108
Which gave me in the end : File1_R1_subsampled = 755092384 bp / File1_R2_subsampled = 754629644 bp. Which is better in term of number of bases but I am not sure if this is accurate. Since I've used this command with 2 input files, reformat.sh
must have understood that it were paired FASTQ files.
I have also tried to it by downsampling separately the R1 file then the R2 file but, even if it returned the right number of bases in the end, I guess that they won't be paired anymore and it won't be useful for the next step which is the assembly if it is the case...
Does anyone know if the method I have used (second code with option
sbt=1509722108
) is a correct way to downsample paired-end FASTQ files with a precise number of bases or is it totally non-sense ?I was told that I could do it manually but I am not sure how, is there any script permitting to remove X last bases from a FASTQ file (if this is how we do it manually, I am not sure about that neither) or to downsample by giving an exact number of bases ?
Thank you in advance, any help would really be appreciated.
Were you actually asked to do this or is that your interpretation?
Since you were asking BBMap to target
1509722108
bases that is what you got withFile1_R1_subsampled = 755092384 bp + File1_R2_subsampled = 754629644 bp.
.The request was to get the same length in bases for both files indeed. Thank you for the confirmation.
It's the request that's nonsense. If the two downsampled files have within a few percent of the same number of bases, that's fine for comparing them to each other.
Indeed, I'll just compare them like that. It gave me an average coverage depth after the assembly of 128 instead of 155 for the downsampled files, which is comparable to the other files with an average coverage depth of the resulting assembly of 126.
A better way to do these comparisons would probably be to normalize the data using
bbnorm.sh
for the runs independently and then compare the results of assembly. A guide is available here.Thank you again, I'll try to apply it today. I didn't look at all to bbnorm.