EDIT: I replaced all of this Perl code with my sample
tool, which implements the sampling approach I describe below using fairly portable C (as well as sampling with replacement and some other features).
If your FASTA file is single-line (header on one line, record on the next), then you can use sample
with -l 2
to sample line pairs:
$ sample -k ${SAMPLES} -l 2 records.fa > sample.fa
(As a bonus, sampling pairs of lines uses half the memory.)
If you have multi-line FASTA, it can be easily converted to single-line FASTA.
Please see the following site for more information: https://github.com/alexpreynolds/sample and sample --help
.
This is not an answer specific to BioPerl, but you could do this with most languages, including Perl.
You don't need to read the entire file into a database stored in memory, but if you can read through the file multiple times, then you can do this with just enough memory to store some metadata about the file, which will be much less than 4.5 GB. The metadata can be used to generate your random sample.
Stream through the file once, collecting the start byte offset of the sequence header and the end byte offset of the last residue in the sequence. Repeat for each sequence. Put these offset data into a list or array. This list will be large but still much smaller than 4.5 GB.
Randomly shuffle the indices of the list with a standard permutation library available to your language of choice, and then pull the first N elements of this list that you want to sample - this is equivalent to sampling without replacement.
Each sample is a random list of byte offsets for a sequence header and its associated sequence.
Note that the list indices are unsigned integers that can be used to order the sequences in the random sample, assuming that your input FASTA file was sorted to begin with. Sort the elements in the sample by the list index values. This will give you byte offsets in order, therefore giving you sorted FASTA output, and it will greatly minimize your I/O time in the next step.
Now open the file a second time and use standard file seek and read operations to read out each sequence element. Your output will be the random sample, in sort order.
Repeat this shuffle-and-pull operation for as many samples as you need to draw.
To demonstrate with some Perl, here is an example FASTA file:
Here is a script that samples five random sequences from this example FASTA file:
A few sample runs might look like this:
$ ./randomSampleFastaWithoutReplacement.pl
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
$ ./randomSampleFastaWithoutReplacement.pl
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
$ ./randomSampleFastaWithoutReplacement.pl
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
One caveat: As it works with byte offsets and not character counts, I think this may not work as written if sequence headers contain Unicode (multibyte) characters. The way to fix this would be to modify the length
function call to return the number of bytes in the sequence header, as opposed to the number of characters.
If your FASTA file is made up solely of ASCII or single-byte characters, then this should run correctly - and scale up to roughly as many start-end byte offset pairs and indices as you can store in memory (three 64-bit integers or 24 bytes per sequence, plus data structure overhead).
duplicate of
choosing random set of seqs from larger set
How to randomly select 2 sequences from Fasta format in PERL?
Hi Pierre,
Sorry if this thread appears redundant, but I don't think that it actually is; neither of the linked threads discuss Bio::DB::Fasta.
In these linked threads, you provide a solution that requires shuff. This isn't a favorable solution here due to the large file size. Chris Penkett's solution is closer to my own, but his will also get stuck because it requires putting all the sequence names into an array.
So, the more pointed question is whether it is possible to extract random ids from $db.
Thanks, Josh