How to create 12.8 billion unique alphanumeric strings of length ranging from 2-6?
5
0
Entering edit mode
8.3 years ago
anu014 ▴ 190

Hello Biostars! I want to create barcodes for each of 3.2 billions genomic locations with each of its 4 types of alleles. Hence, the total barcodes will be 3.2 * 4 billion (12.8 billion).

For this I have written a program like this:

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use String::Random qw/random_regex/;

open my $fh, ">", "test_codes_new.txt";

my %names;
my $a = random_regex('[a-zA-Z][0-9]{1,6}');
$names{$a} = 1;
print $fh "$a\n";
my $b =$names{$a};

for (my $i = 0; $i < 12800000000; $i++){
until (! exists $names{$b}) {$b = random_regex('[a-zA-Z]{1,3}[0-9]{1,3}');}

if (defined $b) {
$names{$b} = 1;
print $fh "$b\n";
         }


                    }

But this is crashing my system invoking 'Apport' program & telling me 'core-dump' plus some of very few values were repeated also. Is there another way to do this?

Thank you in advance!

R perl • 6.5k views
ADD COMMENT
4
Entering edit mode

Hi anupriyaverma1408, I'm not sure what you are doing or trying to achieve, but this doesn't sound like bioinformatics to me. Can you explain the link to bioinformatics? Without link I will close this thread.

Cheers, Wouter

ADD REPLY
0
Entering edit mode

Hello WouterDeCoster , 12.8 billion strings are nothing but 3.2 billions genomic locations * 4 types of alleles. I want to create barcodes for these positions.

ADD REPLY
0
Entering edit mode

Please add this information to your original post and try to be as informative as possible when asking questions.

Does your script run out of memory?

ADD REPLY
0
Entering edit mode

Yes. Because I'm using hash I guess. But I didn't use 100% memory as it was showing 27.7% memory usage in 'top'.

ADD REPLY
0
Entering edit mode

Sorry for the inconvenience.. I'll keep my next posts more informative. Thanks :)

ADD REPLY
4
Entering edit mode
  1. What do you want to use the barcode for? If you need a unique identifier, you can simply enumerate all numbers from 1 to 12 billion.
  2. If you want to use the barcode in genomic sequences somehow, you can only use ACGT.
  3. There are ~57 billion (62^6) combinations. Therefore, to select 12 billion, is a significant portion and will lead to many collisions while you fill up your unique id hash by drawing random strings, and therefore will need many more than 12 billion iterations. Instead, you should iterate over all possible combinations from aaaaaa-ZZZZZZ.
ADD REPLY
0
Entering edit mode

1) Have you searched in google if there is a perl or R library that allows to calculate all the permutations of a set of characters? 2) I am pretty sure that there are better solutions than generating 12.8 billion strings manually, are you sure there are no alternative ways to solve the problem?

ADD REPLY
0
Entering edit mode

I've tried String::Random in perl, uniqid function in php but both are generating redundant strings after some iterations.

ADD REPLY
3
Entering edit mode

Can't you just concatenate the chromosome ID, the position and the allele to get unique identifiers? e.g. chr1:465846131 G could be 01465846131G. Or perhaps I misunderstood.

ADD REPLY
1
Entering edit mode

To add to @WouterDeCoster's comment: Is there a reason the barcodes need to be generated randomly? Why not come up with a 6 character alphanumeric barcoding scheme and assign them that way?

ADD REPLY
1
Entering edit mode

Oh, hahah, great minds think alike! :) ...your great mind was just 9 hours faster :P

ADD REPLY
1
Entering edit mode

But what's the value of just 9 hours in a PhD program...

ADD REPLY
0
Entering edit mode

$25.20

ADD REPLY
3
Entering edit mode
8.3 years ago
John 13k

I'm clearly missing something. You want a unique identifier for every base in the genome, plus every possible allele for that base.

Why not just use the format "chromosome_name:position:base"? You can generate that in 12 seconds.

ADD COMMENT
1
Entering edit mode

Yes that's correct but I want to create just alphanumeric codes of length 6 due to some security reasons. I don't want to use conventional biology terms in the codes..

ADD REPLY
3
Entering edit mode

Ah, ok, well that's a sensible reason not to use what i proposed - however - if you really mean security then you really mean security, and security isn't easy. The famous quote from the author of PGP I think is going to be useful here:

When I was in college in the early 70s, I devised what I believed was a brilliant encryption scheme. A simple pseudorandom number stream was added to the plaintext stream to create ciphertext. This would seemingly thwart any frequency analysis of the ciphertext, and would be uncrackable even to the most resourceful government intelligence agencies. I felt so smug about my achievement.

Years later, I discovered this same scheme in several introductory cryptography texts and tutorial papers. How nice. Other cryptographers had thought of the same scheme. Unfortunately, the scheme was presented as a simple homework assignment on how to use elementary cryptanalytic techniques to trivially crack it. So much for my brilliant scheme. From this humbling experience I learned how easy it is to fall into a false sense of security when devising an encryption algorithm.

Even the pros fail at security with regularity. Rolling your own is probably a really really bad idea. So here's my two bits of advice:

  1. Your 6-character-hash constraint massively reduces the security of the system you are trying to build, because there is simply no well-established hashing algorithm to do this, and anything anyone tells you off the top of their head is going to be easily crackable. If you're trying to protect customer/patient genotyping information, this is most likely also legally unacceptable as it won't conform to any of the security standards (NIST/ISO-x/etc). The ONLY way to do this securely, is to:
    1. Generate 12.8 billion (or whatever) 6-character-strings, from aaaaaa upwards.
    2. Randomly (very randomly) shuffle this list.
    3. Assign each 6-character-string to a chr:pos:base string, from chr1:0:A upwards.
    4. Put the conversion table up on a very well secured server with an API for establishing an encrypted connection,taking a username/password, then accepting your strings and returning the correct chr:pos:base string. Note that this is only as secure as your username/password to access the database.
    5. A server is for life, not just for christmas - you'll need to maintain it all year round with updates, pay the bills, give it treats when it's a good boy, etc.
  2. Alternatively, encrypt the chr:pos:base format with AES-128. For example, "chr14:11423432:A" encrypted with the key of "yolo" would give the encrypted base64 string "bWBJpBiQoEZxzcCsPcDQMg==". This violates the 6-character restriction, but it's how i'd do it.
ADD REPLY
0
Entering edit mode

Thank you John for an explained answer. It clear my thinking regarding encryption. I will use such techniques for encryption in future :)

ADD REPLY
2
Entering edit mode

Generating globally unique ids is a solved problem and has nothing to do with bioinformatics. It takes as little as doing

import uuid
print (uuid.uuid4())

will print:

27ba19d7-3a22-4bd0-9830-e9969e21b40c
ADD REPLY
0
Entering edit mode

I think he wants to go from the uuid back to the position though

ADD REPLY
2
Entering edit mode

In addition to what John and Istvan wrote, I cannot imagine a valid security reason for such a short hash. Generating a cryptographically safe uid as Istvan recommends is preferred. You need to explain better why you want to do this, because I am guessing that this is an XY problem, that means:

create 12.8 billion unique alphanumeric strings of length ranging from 2-6

is probably not even a valid solution to the underlying problem. However, as long as you are reluctant to tell us what the real problem is, we will not be able to help you.

ADD REPLY
1
Entering edit mode
8.3 years ago

Here is my Python script. Run it like this:

python random_barcodes.py 100 (in your case this number would be 12.8 billion)

Direct the output (in linux) by using ">".

import random,sys

number_of_picks=[1,2,3,4,5,6]

try:

    number=int(sys.argv[1])

except IndexError:

    print "Error: Number of barcodes not defined\n"

    print "Usage: python random_barcodes.py <number_of_barcodes>"

    sys.exit()    

count=0

old_barcode=''

while count<number :

alphabet = ''.join(random.choice('0123456789') for i in range(random.choice(number_of_picks)))

numeric=  ''.join(random.choice('ABCDEFGHIJKLMNOPQRSTUVWXYZ') for i in range(random.choice(number_of_picks)))

barcode = alphabet + numeric

if len(barcode) <=6 and barcode != old_barcode:

    print barcode   

    old_barcode= barcode

    count+=1

PS: For some reason, I am unable to make the script text appear as code in Biostar (as I always do). Is there some problem with the site? Can somebody check?

Let me know if this works for you.

ADD COMMENT
3
Entering edit mode

I modified your post to format the code, I just selected the entire text block and clicked the 101010 button. Alternatively you can use a github gist to share code.

Additional suggestion: try to write python3 compatible code by treating print like a function and not as a statement: Use print(barcode) instead of print barcode. I'm also not sure what the purpose of $python is in your script... I wouldn't assume everyone has a $python environmental variable.

Your code only checks if the new barcode isn't equal to the previous but it won't check whether the new barcode has been used earlier, e.g. 10000 iterations ago.

ADD REPLY
3
Entering edit mode

You can easily generate this with python (Python 3):

from itertools import product
patterns = product("ATGC", repeat=17)
for candidate in patterns:
    print ("".join(candidate))

will produce:

AAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAT
AAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAATA
AAAAAAAAAAAAAAATT
AAAAAAAAAAAAAAATG
AAAAAAAAAAAAAAATC
AAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAGT
AAAAAAAAAAAAAAAGG
AAAAAAAAAAAAAAAGC
AAAAAAAAAAAAAAACA
AAAAAAAAAAAAAAACT
AAAAAAAAAAAAAAACG
AAAAAAAAAAAAAAACC
AAAAAAAAAAAAAATAA
AAAAAAAAAAAAAATAT
AAAAAAAAAAAAAATAG
AAAAAAAAAAAAAATAC
AAAAAAAAAAAAAATTA
AAAAAAAAAAAAAATTT
AAAAAAAAAAAAAATTG
AAAAAAAAAAAAAATTC
AAAAAAAAAAAAAATGA
AAAAAAAAAAAAAATGT
AAAAAAAAAAAAAATGG
....

for a total of 17,179,869,184 lines.

ADD REPLY
0
Entering edit mode

Thank you Istvan. This solution worked best for me! It is quite fast as compared to perl scripts that I wrote. I just changed the below command to create alphanumeric codes:

patterns = product("QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm1234567890", repeat=6)
ADD REPLY
0
Entering edit mode
8.3 years ago

Hi WouterDeCoster,

You are correct.

  1. Sorry for the "$" sign. I was just meant to show the start of command prompt in linux. I just removed it.

  2. Thanks for suggestion! I am gradually transitioning to Python3. Also, thank for pinpointing (I am only checking the new barcode with previous one). May be creating a set out of a list having all the barcodes would solve the problem.

Implementing ...

Also, very strange but the "ADD COMMENT" "ADD REPLY" buttons are not working for me. You suggestion appears as : "I modified your post to format the code, I just selected the entire text block and clicked the 101010 button." (notice 101010 instead of a button name. It will be the code button, right?) I cannot see the buttons to insert image, adding code etc. which I used to see. Is it a problem with my browser ? Any guesses?

ADD COMMENT
0
Entering edit mode

Sounds like a browser problem, have you tried another to see if you have the same problem? Have you tried turning off browser extensions and/or clear the cache?

ADD REPLY
0
Entering edit mode
8.3 years ago
ddiez ★ 2.0k

This answer in stackoverflow seems to deal with this problem in R. Basically the package stringi has a function called stri_rand_strings() that generates random alphanumeric strings. Only difference compared to what you want is that the length is fixed. Also, although generating 10^6 strings was very quick, when I tried 10^9 it was still running after some time at the time of this post... (my system: 2.8 GHz Intel Core i5, 8 GB).

ADD COMMENT
0
Entering edit mode
8.3 years ago
Michael 55k

As you wanted an example in Perl: http://www.perlmonks.org/?node_id=991276

The above example shows how to generate all DNA k-mers of $size, I like this example snippet best:

sub permute_loop
{
    my ($size) = @_;
    my  @a     = qw(A T G C);
    while (--$size)
    {
        @a = map { $_ . 'A', $_ . 'T', $_ . 'G', $_ . 'C' } @a;
    }
    return @a;
}
ADD COMMENT

Login before adding your answer.

Traffic: 3897 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6