I have a large fasta file, approximately 4.5GB (hg19, est.fa). I would like to extract a number of sequences from this file at random and without replacement.
use Bio::DB::Fasta;
my $fastaFile=$ARGV[0];
my $numsamples=$ARGV[1];;#Build database of fasta file, if not already built.
print "building database\n";
my $db= Bio::DB::Fasta->new($fastaFile);#Get the sequence ids
print "extracting sequence ids\n";
my @ids =$db->get_all_ids;
for($i= 0;$i<$numsamples;$i++){#Get a random sequence ID from @ids
print "extracting random sequence ".$i."\n";$seqID=$ids[rand @ids];
my $sequence=$db->seq($seqID);if(!defined($sequence)){
die "Sequence $seqID not found. \n"}
print ">$seqID\n", "$sequence\n";}
This works very well for small to moderately sized files (e.g., hg19, upstream1000.fa), but for larger files the code hangs on the "my @ids = $db->get_all_ids;" line.
My question: Is it possible to modify this line to simply grab, say, 100 of these IDs at random? If not, is there another way to solve this problem?
Sorry if this thread appears redundant, but I don't think that it actually is; neither of the linked threads discuss Bio::DB::Fasta.
In these linked threads, you provide a solution that requires shuff. This isn't a favorable solution here due to the large file size. Chris Penkett's solution is closer to my own, but his will also get stuck because it requires putting all the sequence names into an array.
So, the more pointed question is whether it is possible to extract random ids from $db.
EDIT: I replaced all of this Perl code with my sample tool, which implements the sampling approach I describe below using fairly portable C (as well as sampling with replacement and some other features).
If your FASTA file is single-line (header on one line, record on the next), then you can use sample with -l 2 to sample line pairs:
This is not an answer specific to BioPerl, but you could do this with most languages, including Perl.
You don't need to read the entire file into a database stored in memory, but if you can read through the file multiple times, then you can do this with just enough memory to store some metadata about the file, which will be much less than 4.5 GB. The metadata can be used to generate your random sample.
Stream through the file once, collecting the start byte offset of the sequence header and the end byte offset of the last residue in the sequence. Repeat for each sequence. Put these offset data into a list or array. This list will be large but still much smaller than 4.5 GB.
Randomly shuffle the indices of the list with a standard permutation library available to your language of choice, and then pull the first N elements of this list that you want to sample - this is equivalent to sampling without replacement.
Each sample is a random list of byte offsets for a sequence header and its associated sequence.
Note that the list indices are unsigned integers that can be used to order the sequences in the random sample, assuming that your input FASTA file was sorted to begin with. Sort the elements in the sample by the list index values. This will give you byte offsets in order, therefore giving you sorted FASTA output, and it will greatly minimize your I/O time in the next step.
Now open the file a second time and use standard file seek and read operations to read out each sequence element. Your output will be the random sample, in sort order.
Repeat this shuffle-and-pull operation for as many samples as you need to draw.
To demonstrate with some Perl, here is an example FASTA file:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here is a script that samples five random sequences from this example FASTA file:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
One caveat: As it works with byte offsets and not character counts, I think this may not work as written if sequence headers contain Unicode (multibyte) characters. The way to fix this would be to modify the length function call to return the number of bytes in the sequence header, as opposed to the number of characters.
If your FASTA file is made up solely of ASCII or single-byte characters, then this should run correctly - and scale up to roughly as many start-end byte offset pairs and indices as you can store in memory (three 64-bit integers or 24 bytes per sequence, plus data structure overhead).
duplicate of
choosing random set of seqs from larger set
How to randomly select 2 sequences from Fasta format in PERL?
Hi Pierre,
Sorry if this thread appears redundant, but I don't think that it actually is; neither of the linked threads discuss Bio::DB::Fasta.
In these linked threads, you provide a solution that requires shuff. This isn't a favorable solution here due to the large file size. Chris Penkett's solution is closer to my own, but his will also get stuck because it requires putting all the sequence names into an array.
So, the more pointed question is whether it is possible to extract random ids from $db.
Thanks, Josh