Question

subsample fasta to certain size

0

Entering edit mode

2.7 years ago

Sam ▴ 20

Hi there,

Can anyone suggest a tool or method to extract random 10GB reads with minimum read length of (1000bp) from a huge 100 Gb file.

I have 50 different fa.gz files with varying size (20 -100GB) and I like to subsample fasta with 10gb size each.

Thanks

Best
sam

fasta sequence • 1.3k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 2.7 years ago by Sam ▴ 20

0

Entering edit mode

sampling 100 Gb file takes long time and requires enough computational resources. You can use seqkit to random sample fasta files. use -m and -M options depending on your requirements. Use two pass and threads.

ADD REPLY • link 2.7 years ago by cpad0112 21k

score 1 · Answer 1 · 2022-03-19

1

Entering edit mode

2.7 years ago

Mensur Dlakic ★ 28k

reformat.sh from the BBTools package can do that.

reformat.sh in=in.fq out=out.fq samplereadstarget=5000000

Instead of specifying the exact number of reads (5 million above), you can use a fraction (samplerate=0.2).

Not sure that random subsampling is a good idea, especially if you have a metagenome and some MAGs may be in low abundance. A longer way to do this, but I think also better, is by digital normalization as implemented in khmer.

PS Never mind my khmer suggestion - I just realized that you have fasta files.

ADD COMMENT • link 2.7 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Thanks a lot Mensur Dlakic, I also found this tool very helpful https://github.com/rrwick/Filtlong

ADD REPLY • link 2.7 years ago by Sam ▴ 20