An alternative to using grep is to use a hash table in a scripting language like Perl or Python. In Python, hash tables are called dictionaries. I'll show an example script below that uses Perl.
Hash tables use a lot of memory, but they are fast - often faster than grep - because looking up a value in a hash table takes constant time.
You say your BED data looks like this:
chr14 8835890 8835954 HWUSI-EAS476:3:2:4670:1030#0/1 42 -
chr14 8835750 8835806 HWUSI-EAS476:3:2:4670:1030#0/2 42 +
chr3 135330299 135330319 HWUSI-EAS476:3:2:6478:1025#0/1 42 +
chr3 135330468 135330508 HWUSI-EAS476:3:2:6478:1025#0/2 42 -
...
The patterns you are searching for look like everything up to the hash character (#
) in the ID field of this BED file. So if we parse the BED file line by line and turn its ID field into a key that matches your pattern of interest, then we can do a quick lookup of the hash table.
Let's assume that your patterns are in a text file called patterns.txt
and your BED elements are in a file called data.bed
.
Here is an example Perl script called filter.pl
that you could use to filter your BED file against your patterns:
#!/usr/bin/env perl
use strict;
use warnings;
my $patternFn = $ARGV[0];
if (! defined $patternFn) { die "Error: Specify pattern filename\n"; }
# build hash table of keys
print STDERR "Building hash table from patterns...\n";
my $keyring;
open KEYS, "<", $patternFn;
while (<KEYS>) {
chomp $_;
$keyring->{$_} = 1;
}
close KEYS;
# look up key from each line of standard input
print STDERR "Filtering standard input against patterns...\n";
while (<STDIN>) {
chomp $_;
my ($chr, $start, $stop, $id, $score, $strand) = split("\t", $_);
if (! defined $id) { die "Error: BED input is missing ID field\n"; }
(my $key = $id) =~ s/#.*+//g; # copy id to key and modify key
if (defined $keyring->{$key}) { print "$_\n"; }
}
You could use this like so:
$ filter.pl patterns.txt < data.bed > data.filtered.bed
Again, if your list of patterns is too large to make a hash table that fits into system memory, you can use split
on patterns.txt
and then run filter.pl
on each smaller patterns file. Then run cat
on each filtered BED file to bring them all together into one result.
Thanks Alex, I tried your advice.
But the process has aborted because of a shortage of the memory.
I assigned 20G memory to this process but eventually it aborted.
Someone knows how to attain this purpose without a lot of memory.
Please help me!
Split up your file containing ~6M patterns into smaller subsets using
(see
man split
for more information). Perhaps start with 100k lines per subset, which will give you 60 smaller files containing a subset of patterns.Run separate
grep -f <filename>
processes, where each<filename>
is one of the pattern's subsets.If you have access to a computational cluster, you can run each of the separate
grep
calls on each node. Otherwise, run them in series on your computer, one after the other.When all the searches are done, union all the search results with
cat
.