Entering edit mode
11.2 years ago
mad.cichlids
▴
140
Hi, All
I was trying to use seqtk to randomly pickup 32,000,000 sequences from my sequence pool.
I first combined all of sequences from two samples with :
zcat sub592_1.headcropped.fastq.gz sub592_2.headcropped.fastq.gz > sub592.fastq.gz
then verified that the total number of reads in my file by :
`cat sub592.fastq.gz | echo $((`wc -l`/4))` #indicate I have 39,965,078
but when I am using seqtk to draw 32million reads from the pool, it did not do the job for me, instead, it just said “killed” at the end, could you help me point to a direction where could it go wrong?
seqtk sample ~/Desktop/rnaseq/trimmed/sub592.fastq.gz 3.2e+7 > s592_32.fastq
Thank you very much for your help and have a great day !
Thank you so much, Istvan, is there anyway around it? or just find another machine to do the job. Best, B
there have been some posts on this topic but not all of the answers would work for selecting a large subset.
I think some of the linearization methods as described below should work but each requires a shuffle or a sort so might take a while probably:
Selecting random pairs from fastq?
An easier solution (in case you don't need exactly 32 million only something close to that number) is to go through the file and generate a random number for each record and print it out if larger than a threshold (at the level that you need)
something as simple as this would do:
it outputs:
see how it almost splits at 50%. If you need to split two files the same way (paired end files) just set the random seed to be the same for both files.