Extracting Random Sequences From A Large Fasta File Using Bioperl

0

Entering edit mode

11.2 years ago

joshualevipayne ▴ 70

Hello.

I have a large fasta file, approximately 4.5GB (hg19, est.fa). I would like to extract a number of sequences from this file at random and without replacement.

I've followed some wonderful advice in an earlier thread (Extracting multiple fasta sequences at a time from a file containing MANY sequences) to build a database using BioPerl. The code is below.

 use Bio::DB::Fasta;

 my $fastaFile = $ARGV[0];
 my $numsamples = $ARGV[1];;

 #Build database of fasta file, if not already built.
 print "building database\n";
 my $db = Bio::DB::Fasta->new( $fastaFile );

 #Get the sequence ids
 print "extracting sequence ids\n";
 my @ids = $db->get_all_ids;

 for($i = 0; $i < $numsamples; $i++) {

      #Get a random sequence ID from @ids
      print "extracting random sequence " . $i . "\n";
      $seqID = $ids[rand @ids];    

      my $sequence = $db->seq($seqID);
      if  (!defined( $sequence )) {
        die "Sequence $seqID not found. \n" 
       }   
      print ">$seqID\n", "$sequence\n";
 }

This works very well for small to moderately sized files (e.g., hg19, upstream1000.fa), but for larger files the code hangs on the "my @ids = $db->get_all_ids;" line.

My question: Is it possible to modify this line to simply grab, say, 100 of these IDs at random? If not, is there another way to solve this problem?

fasta bioperl • 8.3k views

ADD COMMENT • link updated 11.1 years ago by Alex Reynolds 36k • written 11.2 years ago by joshualevipayne ▴ 70

0

Entering edit mode

duplicate of

choosing random set of seqs from larger set

How to randomly select 2 sequences from Fasta format in PERL?

ADD REPLY • link 11.2 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre,

Sorry if this thread appears redundant, but I don't think that it actually is; neither of the linked threads discuss Bio::DB::Fasta.

In these linked threads, you provide a solution that requires shuff. This isn't a favorable solution here due to the large file size. Chris Penkett's solution is closer to my own, but his will also get stuck because it requires putting all the sequence names into an array.

So, the more pointed question is whether it is possible to extract random ids from $db.

Thanks, Josh

ADD REPLY • link 11.2 years ago by joshualevipayne ▴ 70

1

Entering edit mode

11.1 years ago

Alex Reynolds 36k

EDIT: I replaced all of this Perl code with my sample tool, which implements the sampling approach I describe below using fairly portable C (as well as sampling with replacement and some other features).

If your FASTA file is single-line (header on one line, record on the next), then you can use sample with -l 2 to sample line pairs:

$ sample -k ${SAMPLES} -l 2 records.fa > sample.fa

(As a bonus, sampling pairs of lines uses half the memory.)

If you have multi-line FASTA, it can be easily converted to single-line FASTA.

Please see the following site for more information: https://github.com/alexpreynolds/sample and sample --help.

This is not an answer specific to BioPerl, but you could do this with most languages, including Perl.

You don't need to read the entire file into a database stored in memory, but if you can read through the file multiple times, then you can do this with just enough memory to store some metadata about the file, which will be much less than 4.5 GB. The metadata can be used to generate your random sample.

Stream through the file once, collecting the start byte offset of the sequence header and the end byte offset of the last residue in the sequence. Repeat for each sequence. Put these offset data into a list or array. This list will be large but still much smaller than 4.5 GB.

Randomly shuffle the indices of the list with a standard permutation library available to your language of choice, and then pull the first N elements of this list that you want to sample - this is equivalent to sampling without replacement.

Each sample is a random list of byte offsets for a sequence header and its associated sequence.

Note that the list indices are unsigned integers that can be used to order the sequences in the random sample, assuming that your input FASTA file was sorted to begin with. Sort the elements in the sample by the list index values. This will give you byte offsets in order, therefore giving you sorted FASTA output, and it will greatly minimize your I/O time in the next step.

Now open the file a second time and use standard file seek and read operations to read out each sequence element. Your output will be the random sample, in sort order.

Repeat this shuffle-and-pull operation for as many samples as you need to draw.

To demonstrate with some Perl, here is an example FASTA file:

	>seq0
	FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
	>seq1
	KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
	LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
	>seq2
	EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
	>seq3
	MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
	>seq4
	EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
	>seq5
	SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
	>seq6
	FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
	>seq7
	SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
	>seq8
	SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
	>seq9
	KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
	>seq10
	FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

view raw example.fa hosted with ❤ by GitHub

Here is a script that samples five random sequences from this example FASTA file:

	#!/usr/bin/env perl

	use strict;
	use warnings;
	use List::Util qw(shuffle);
	use Fcntl qw(SEEK_CUR);

	my $inFn = "example.fa";
	my $n = 5;

	my ($inFh, $seqs, $seqIdx, $start, $end, $seq);

	#
	# discover byte bounds of FASTA sequences
	#

	$seqIdx = 0;
	open $inFh, "<", $inFn or die "could not open FASTA input\n$?";
	while (<$inFh>) {
	if ($_ =~ /^>/) {
	$start = (tell $inFh) - length($_);
	if ($start == 0) {
	$seqs->[$seqIdx]->{startByte} = $start;
	}
	else {
	$end = $start - 1;
	$seqs->[$seqIdx]->{endByte} = $end;
	$seqIdx++;
	$seqs->[$seqIdx]->{startByte} = $start;
	}
	}
	}
	$end = tell $inFh;
	$seqs->[$seqIdx]->{endByte} = $end;
	close $inFh;

	#
	# permute indices
	# sample first N elements from permuted indices
	# sort indices
	#

	my @idxs = map { $_ } (0 .. scalar(@{$seqs}) - 1);
	my @shuffledIdxs = shuffle @idxs;
	undef @idxs;
	my @sampleIdxs = @shuffledIdxs[0..($n - 1)];
	my @sortedSampleIdxs = sort { $a <=> $b } @sampleIdxs;

	#
	# draw the sample of sequence byte boundaries
	#

	my @sampleSeqs = map { $seqs->[$_] } @sortedSampleIdxs;

	#
	# read out sampled sequences from input FASTA file
	# as byte offsets are ordered, we can use SEEK_CUR to minimize I/O from seeking through file
	#

	my $pStart = 0;
	my $pLength = 0;
	open $inFh, "<", $inFn or die "could not open FASTA input\n$?";
	foreach (@sampleSeqs) {
	seek $inFh, $_->{startByte} - $pStart - $pLength, SEEK_CUR;
	$pStart = $_->{startByte};
	$pLength = $_->{endByte} - $_->{startByte};
	read $inFh, $seq, $pLength;
	print STDOUT "$seq\n";
	}
	close $inFh;

view raw randomSampleFastaWithoutReplacement.pl hosted with ❤ by GitHub

A few sample runs might look like this:

$ ./randomSampleFastaWithoutReplacement.pl
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

$ ./randomSampleFastaWithoutReplacement.pl
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR

$ ./randomSampleFastaWithoutReplacement.pl
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

One caveat: As it works with byte offsets and not character counts, I think this may not work as written if sequence headers contain Unicode (multibyte) characters. The way to fix this would be to modify the length function call to return the number of bytes in the sequence header, as opposed to the number of characters.

If your FASTA file is made up solely of ASCII or single-byte characters, then this should run correctly - and scale up to roughly as many start-end byte offset pairs and indices as you can store in memory (three 64-bit integers or 24 bytes per sequence, plus data structure overhead).

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 11.1 years ago by Alex Reynolds 36k

0

Entering edit mode

11.1 years ago

Prakki Rama ★ 2.7k

I am not an expert programmer. But I tried the following to generate the random fasta sequences from a file. Took help of script Fasta1line.pl

#`perl fasta1line.pl upstream5000.fa OL_upstream5000.fa`; #Converted Sequences into Online Fasta
@random_set;
$var=`grep '>' OL_upstream5000.fa | cut -d ">" -f 2`;
@SeqNames=split("\n",$var);
%Hash=();

$Seq_needed=8; ###Number of random sequences required

$SeqIndex=$#SeqNames+1;
for (0..$Seq_needed) {
     $candidate = int rand($SeqIndex);
     redo if $seen{$candidate}++;
    $SeqNames =~ s/^\s+//; #remove leading spaces
    $SeqNames =~ s/\s+$//; #remove trailing spaces

    $Hash{$SeqNames[$candidate]}="0";
    push @random_set, $candidate;

    if(($#random_set+1)==$Seq_needed)
    {
    #print "$#random_set+1 is equal to $Seq_needed, Terminating LOOP\n";
    last;
    }
  }
#print "random numbers used: @random_set\n";

#print "##extract fasta sequence\n";
open FH,"OL_upstream5000.fa";
$cnt=0;
open OUT,">out.txt";
while(<FH>)
{
    if($cnt==$Seq_needed)
    {last;}
    $header=$_;
    $Seq=<FH>;
    $header =~ s/\>//g;
    $header =~ s/^\s+//; #remove leading spaces
    $header =~ s/\s+$//; #remove trailing spaces
    #print "$header\n";
    if(exists($Hash{$header}))
    {
    print OUT"\>$header\n";
    print OUT"$Seq";
    $cnt++;
    }
}
close(FH);
close(OUT);

ADD COMMENT • link 11.1 years ago by Prakki Rama ★ 2.7k

Login before adding your answer.