Order of reads in Long read FASTQ file
1
0
Entering edit mode
2.1 years ago
shinyjj ▴ 50

Hi all,

FASTQ files contain sequencing reads 'as they come off the sequencing instrument.' Is there any particular order to them in long read fastq file for ONT and PacBio? E.g. based on the position of the flow cell? Quality?

I am trying to extract certain number of reads from both ONT and PacBio using seqtk sample something like below.

./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq

I want to make sure the above example, pcb_sub.fastq, gives 10,000 reads among the total number of reads in pcb.fastq file.

Thanks in advance.

fastq seqtk rna-seq • 1.3k views
ADD COMMENT
0
Entering edit mode
2.1 years ago
GenoMax 147k

You could sub-sample and generate a number of files. Cat them together and then sample again from that pool to ensure that you get a random mix.

You could also try reformat.sh from BBMap suite that give you control over how you sample:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0 srt) Exact number of OUTPUT reads (or pairs) desired. 
ADD COMMENT
0
Entering edit mode

Hello, thanks for helping. Can you clarify what do you mean by "Cat them together and then sample again from that pool to ensure that you get a random mix"?

I thought extracting pcb_sub.fastq as the above command gives random mix of 10,000 reads.

Thanks!

ADD REPLY
0
Entering edit mode

reformat.sh will give you a random mix if you use the sampling parameters. If you were worried about there being some pattern in either seqtk or above command then you can do multiple sampling rounds to get a new set to sample from, if you had a gigantic dataset to begin with. I was perhaps being too conservative.

ADD REPLY
0
Entering edit mode

Oh I see, so for seqtk, I can do the above command line with different seed number such as ./seqtk sample -s100 pcb.fastq 10000 > pcb_sub.fastq in the first round and, ./seqtk sample -s101 pcb_sub.fastq10000 > pcb_sub.fastq in the second round, right?

ADD REPLY
0
Entering edit mode

If you want to be super cautious, then yes.

ADD REPLY

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6