one hot encoding for human genome
2
0
Entering edit mode
8.4 years ago

Hello all, I am working on a project that requires a one hot encoded human genome as a input data. I came up with a Python script, however it's not efficient in memory. It's working but requires a lot of memory, for 3gb fasta file it uses almost 20gb memory. Can you help me improve the memory efficiency and (if possible) time efficiency of this algorithm

import pandas as pd
import numpy as np
from Bio import SeqIO
import time 
import h5py

def vectorizeSequence(seq):
    # the order of the letters is not arbitrary.
    # Flip the matrix up-down and left-right for reverse compliment
    ltrdict = {'a':[1,0,0,0],'c':[0,1,0,0],'g':[0,0,1,0],'t':[0,0,0,1], 'n':[0,0,0,0]}
    return np.array([ltrdict[x] for x in seq])


starttime = time.time()
fasta_sequences = SeqIO.parse(open("hg38_new.fa"),'fasta')
with h5py.File('genomeEncoded.h5', 'w') as hf:
    for fasta in fasta_sequences:
        # get the fasta files.
        name, sequence = fasta.id, fasta.seq.tostring()
        # Write the chromosome name
        new_file.write(name)
        # one hot encode...
        data=vectorizeSequence(sequence.lower())
        print name + " is one hot encoded!"
        # write to hdf5 
        hf.create_dataset(name, data=data)
        print name + " is written to dataset"


endtime = time.time()       
print "Encoding is done in " + str(endtime)

Thanks in advance...

genome machine learning • 6.5k views
ADD COMMENT
0
Entering edit mode

What is a a hot encoded genome for the benefit of non-programmers? Is it this?

ADD REPLY
0
Entering edit mode

Yes in fact am also curious to know what is hot encoded? is it a jargon or some memory efficiency method?

ADD REPLY
2
Entering edit mode

For problems that are categorical, such as representing a nucleotide in a set {A, C, G, T}, we can encode each nucleotide in a separate column of a matrix where column 1 corresponds to A being present, column 2 to C being present, and so on. This kind of representation is called one-hot encoding, because only one bit is set to true. There are other ways to encode categorical data, but this is a common one for creating a numeric matrix that can be fed into a machine learning algorithm.

ADD REPLY
1
Entering edit mode
8.4 years ago
Steven Lakin ★ 1.8k

Does your downstream machine learning application require this format? It is common to encode the standard DNA bases bitwise:

A = 0b00
C = 0b01
G = 0b10
T = 0b11

This will compress your encoding to two bits per nucleotide. Right now you're multiplying the size by 4x and incurring the cost of numpy array, though this is fairly well optimized behind the scenes. However, if you do 2-bit encoding, you need to know how to compute distances on encoded DNA strings, which usually means you'd have to implement the machine learning algorithm yourself.

If your downstream application requires vectorized one-hot encoding as you've shown us here, I don't think you can do better than your numpy code, though you may gain performance by using tuples in your dictionary instead of lists. Perhaps you could look into SciPy's implementation of sparse matrices for better memory management, but I'm not sure what the results will be.

ltrdict = {'a':(1,0,0,0),'c':(0,1,0,0),'g':(0,0,1,0),'t':(0,0,0,1), 'n':(0,0,0,0)}

As an aside, it has been my experience that machine learning with large genomic data in Python incurs a very high overhead cost and is slow due to the nature of the language. C-based languages are likely best for these kinds of applications.

ADD COMMENT
0
Entering edit mode

I actually tried, bitwise encoding but did not work as I expected. Maybe I could not manage to do it correct, I don't know.

ADD REPLY
0
Entering edit mode
8.4 years ago

It looks like you want to convert to a 4-bit encoding. I don't really understand your code - it looks like it's translating bases to arrays of numbers, which would be really inefficient... normally you should translate: A -> 1 C -> 2 G -> 4 T -> 8

You can simply read 1 base at a time, convert, and write 1 base at a time. There's no reason to use any memory beyond the minimum for efficient buffering (a few KB which will probably be handled automatically).

P.S. If the output needs to be packed at 2 bases per byte, you would need to process 2 bases at a time, and know the endian-ness required... there's actually not enough information given to answer the question, in any case.

ADD COMMENT
0
Entering edit mode

The reason why I want encoded input, is the for the deep learning model it needs to take all the nucleotide as the same weight. If I chose to give them 1, 2 4, 8 then there will be bias between the weights, the output will be compromised

ADD REPLY
0
Entering edit mode

Oh - I guess I don't really understand what kind of format the tool needs. 1, 2, 4, and 8 are all the same weight in binary; each has three 0s and a single 1.

ADD REPLY

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6