I have a fastq file with over 2 million reads. I am trying to use FastqSampler to select 2 million reads at random. But strangely, I don't get 2 million, I get something less -- 1929951 in the following example. Why?
Might it have something to do with the way FastqSampler chunks the input file? (author is describing this here: A: Selecting random pairs from fastq? )
It works fine if I set n to 1 million reads.
> library(ShortRead)
> fq=readFastq("file.fq")
> fq
class: ShortReadQ
length: 2198402 reads; width: 100 cycles
>
> fqs=FastqSampler("file.fq", n=2e6)
> yield(fqs)
class: ShortReadQ
length: 1929951 reads; width: 100 cycles
>
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ShortRead_1.14.4 Rsamtools_1.8.6 lattice_0.20-13 Biostrings_2.24.1 GenomicRanges_1.8.13 IRanges_1.14.4
loaded via a namespace (and not attached):
[1] Biobase_2.16.0 grid_2.15.2 hwriter_1.3 tools_2.15.2