Hello Biostars! I want to create barcodes for each of 3.2 billions genomic locations with each of its 4 types of alleles. Hence, the total barcodes will be 3.2 * 4 billion (12.8 billion).
For this I have written a program like this:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use String::Random qw/random_regex/;
open my $fh, ">", "test_codes_new.txt";
my %names;
my $a = random_regex('[a-zA-Z][0-9]{1,6}');
$names{$a} = 1;
print $fh "$a\n";
my $b =$names{$a};
for (my $i = 0; $i < 12800000000; $i++){
until (! exists $names{$b}) {$b = random_regex('[a-zA-Z]{1,3}[0-9]{1,3}');}
if (defined $b) {
$names{$b} = 1;
print $fh "$b\n";
}
}
But this is crashing my system invoking 'Apport' program & telling me 'core-dump' plus some of very few values were repeated also. Is there another way to do this?
Thank you in advance!
Hi anupriyaverma1408, I'm not sure what you are doing or trying to achieve, but this doesn't sound like bioinformatics to me. Can you explain the link to bioinformatics? Without link I will close this thread.
Cheers, Wouter
Hello WouterDeCoster , 12.8 billion strings are nothing but 3.2 billions genomic locations * 4 types of alleles. I want to create barcodes for these positions.
Please add this information to your original post and try to be as informative as possible when asking questions.
Does your script run out of memory?
Yes. Because I'm using hash I guess. But I didn't use 100% memory as it was showing 27.7% memory usage in 'top'.
Sorry for the inconvenience.. I'll keep my next posts more informative. Thanks :)
1) Have you searched in google if there is a perl or R library that allows to calculate all the permutations of a set of characters? 2) I am pretty sure that there are better solutions than generating 12.8 billion strings manually, are you sure there are no alternative ways to solve the problem?
I've tried String::Random in perl, uniqid function in php but both are generating redundant strings after some iterations.
Can't you just concatenate the chromosome ID, the position and the allele to get unique identifiers? e.g. chr1:465846131 G could be 01465846131G. Or perhaps I misunderstood.
To add to @WouterDeCoster's comment: Is there a reason the barcodes need to be generated randomly? Why not come up with a 6 character alphanumeric barcoding scheme and assign them that way?
Oh, hahah, great minds think alike! :) ...your great mind was just 9 hours faster :P
But what's the value of just 9 hours in a PhD program...
$25.20