Question

Order of reads in Long read FASTQ file

0

Entering edit mode

2.8 years ago

shinyjj ▴ 60

Hi all,

FASTQ files contain sequencing reads 'as they come off the sequencing instrument.' Is there any particular order to them in long read fastq file for ONT and PacBio? E.g. based on the position of the flow cell? Quality?

I am trying to extract certain number of reads from both ONT and PacBio using seqtk sample something like below.

./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq

I want to make sure the above example, pcb_sub.fastq, gives 10,000 reads among the total number of reads in pcb.fastq file.

Thanks in advance.

fastq seqtk rna-seq • 1.7k views

ADD COMMENT • link updated 2.8 years ago by GenoMax 152k • written 2.8 years ago by shinyjj ▴ 60

score 0 · Answer 1 · 2022-10-01

0

Entering edit mode

2.8 years ago

GenoMax 152k

You could sub-sample and generate a number of files. Cat them together and then sample again from that pool to ensure that you get a random mix.

You could also try reformat.sh from BBMap suite that give you control over how you sample:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0 srt) Exact number of OUTPUT reads (or pairs) desired.

ADD COMMENT • link 2.8 years ago by GenoMax 152k

0

Entering edit mode

Hello, thanks for helping. Can you clarify what do you mean by "Cat them together and then sample again from that pool to ensure that you get a random mix"?

I thought extracting pcb_sub.fastq as the above command gives random mix of 10,000 reads.

Thanks!

ADD REPLY • link 2.8 years ago by shinyjj ▴ 60

0

Entering edit mode

reformat.sh will give you a random mix if you use the sampling parameters. If you were worried about there being some pattern in either seqtk or above command then you can do multiple sampling rounds to get a new set to sample from, if you had a gigantic dataset to begin with. I was perhaps being too conservative.

ADD REPLY • link 2.8 years ago by GenoMax 152k

0

Entering edit mode

Oh I see, so for seqtk, I can do the above command line with different seed number such as ./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq in the first round and, ./seqtk sample -s101 pcb_sub.fastq10000 > pcb_sub.fastq in the second round, right?