I am looking for a simple script or tool that will randomize the order of sequences in a very large fasta file (8Gb). Note that I do not want to touch any sequence data within a given sequence, only the order of entire sequences.
After randomizing I would also like to use a function like "head" that takes the top X number of sequences from that randomized file.
I will be comparing the full randomized file and random subsets of that file, so while a single script that accomplishes both tasks seems practical, two separate scripts are more usefull and probably easier to write.
Thanks for the help
Something like this?
Here's the link to
faSomeRecords
Alternatively for headers file:
grep "^>" file.fasta | sort -R > headers.txt
shuf
is probably faster though..Neither shuf or sort -R work with OS X though. Well, not without installing GNU core utils anyway..
sort -R
works based on the alphabets contained in headers, so I thoughtshuf
is more effective in randomizing.The comment by 5heikki about it not working refers to the command not being available on a Mac, rather than one being more effective. In place of
shuf
, you can do this on a Mac:I'm not going to comment further because I think this question has been thoroughly answered already, esp. in the links I provided in my other comment.
I like this, thanks. Simple and easily modified to fit the specific situation.
If you search the forum you will find a number of helpful posts (look in the right-hand column for references). E.g., Resampling Fastq Sequences Without Replacement and Choosing Random Set Of Seqs From Larger Set. Just FYI, adding to existing discussions is preferred over creating duplicate posts because that makes it easier to find answers.