Hello all, I am working on a project that requires a one hot encoded human genome as a input data. I came up with a Python script, however it's not efficient in memory. It's working but requires a lot of memory, for 3gb fasta file it uses almost 20gb memory. Can you help me improve the memory efficiency and (if possible) time efficiency of this algorithm
import pandas as pd
import numpy as np
from Bio import SeqIO
import time
import h5py
def vectorizeSequence(seq):
# the order of the letters is not arbitrary.
# Flip the matrix up-down and left-right for reverse compliment
ltrdict = {'a':[1,0,0,0],'c':[0,1,0,0],'g':[0,0,1,0],'t':[0,0,0,1], 'n':[0,0,0,0]}
return np.array([ltrdict[x] for x in seq])
starttime = time.time()
fasta_sequences = SeqIO.parse(open("hg38_new.fa"),'fasta')
with h5py.File('genomeEncoded.h5', 'w') as hf:
for fasta in fasta_sequences:
# get the fasta files.
name, sequence = fasta.id, fasta.seq.tostring()
# Write the chromosome name
new_file.write(name)
# one hot encode...
data=vectorizeSequence(sequence.lower())
print name + " is one hot encoded!"
# write to hdf5
hf.create_dataset(name, data=data)
print name + " is written to dataset"
endtime = time.time()
print "Encoding is done in " + str(endtime)
Thanks in advance...
What is a
a hot encoded
genome for the benefit of non-programmers? Is it this?Yes in fact am also curious to know what is
hot encoded
? is it a jargon or some memory efficiency method?For problems that are categorical, such as representing a nucleotide in a set {A, C, G, T}, we can encode each nucleotide in a separate column of a matrix where column 1 corresponds to A being present, column 2 to C being present, and so on. This kind of representation is called one-hot encoding, because only one bit is set to true. There are other ways to encode categorical data, but this is a common one for creating a numeric matrix that can be fed into a machine learning algorithm.