Hi all,
FASTQ files contain sequencing reads 'as they come off the sequencing instrument.' Is there any particular order to them in long read fastq file for ONT and PacBio? E.g. based on the position of the flow cell? Quality?
I am trying to extract certain number of reads from both ONT and PacBio using seqtk sample something like below.
./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq
I want to make sure the above example, pcb_sub.fastq, gives 10,000 reads among the total number of reads in pcb.fastq file.
Thanks in advance.
Hello, thanks for helping. Can you clarify what do you mean by "Cat them together and then sample again from that pool to ensure that you get a random mix"?
I thought extracting pcb_sub.fastq as the above command gives random mix of 10,000 reads.
Thanks!
reformat.sh
will give you a random mix if you use the sampling parameters. If you were worried about there being some pattern in eitherseqtk
or above command then you can do multiple sampling rounds to get a new set to sample from, if you had a gigantic dataset to begin with. I was perhaps being too conservative.Oh I see, so for seqtk, I can do the above command line with different seed number such as
./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq
in the first round and,./seqtk sample -s101 pcb_sub.fastq10000 > pcb_sub.fastq
in the second round, right?If you want to be super cautious, then yes.