subsample fasta to certain size
1
0
Entering edit mode
2.7 years ago
Sam ▴ 20

Hi there,

Can anyone suggest a tool or method to extract random 10GB reads with minimum read length of (1000bp) from a huge 100 Gb file.

I have 50 different fa.gz files with varying size (20 -100GB) and I like to subsample fasta with 10gb size each.

Thanks

Best
sam

fasta sequence • 1.3k views
ADD COMMENT
0
Entering edit mode

sampling 100 Gb file takes long time and requires enough computational resources. You can use seqkit to random sample fasta files. use -m and -M options depending on your requirements. Use two pass and threads.

ADD REPLY
1
Entering edit mode
2.7 years ago
Mensur Dlakic ★ 28k

reformat.sh from the BBTools package can do that.

reformat.sh in=in.fq out=out.fq samplereadstarget=5000000

Instead of specifying the exact number of reads (5 million above), you can use a fraction (samplerate=0.2).

Not sure that random subsampling is a good idea, especially if you have a metagenome and some MAGs may be in low abundance. A longer way to do this, but I think also better, is by digital normalization as implemented in khmer.

PS Never mind my khmer suggestion - I just realized that you have fasta files.

ADD COMMENT
0
Entering edit mode

Thanks a lot Mensur Dlakic, I also found this tool very helpful https://github.com/rrwick/Filtlong

ADD REPLY

Login before adding your answer.

Traffic: 1781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6