I'm trying to develop a machine learning algorithm using LinearSVC and another one using Convolutional Neural Networks to classify DNA sequences. I've had to one hot encode the DNA sequences and then I stored the resulting arrays for each sequence in a list. But when I do the train-test split step I wasn't able to use it.
My DNA sequences are like this (not my real dataset, which is way bigger, just to exemplify. All the sequences are in the file 'seqs_for_test.fasta'):
>TE_seq1 CCATAAACTATCTAAATAAGCACTTTTCTGGCTCTCTGGCCCCCCTTCTTCTTTTTGGGAAGGTGACAG AGGGTAAAAGGGCTCTCTGCCGTGCGAGGCTCCTCACAGACACACAGCAAGAAAGAAGCGCCGCGCAGCA
>TE_seq2 GATAGCCCCTCTCCCAGCCCCAGTCTGATCCCTAACCCTAACTCCACGGCTCCTGTCTCTACCCCCGTCT CTTTCTTCTTGTACCCTAGTCCCCCAGATCATTAGCTCCCTGCTCGGGCCCAGGGTTTTAAGAGAAGCCC
>TE_seq3 TGACTCAAGTCATGCTACCCAGCCCCGTCTTCTTAAAAATGAGACATGTTGAGACACCCTGCTTTTCGCC TACAAACACATCCATTCTCTATACTTAGTCTTATTTAAATTCTATCCTCTGTATGTCTAGTCCTGGGGGT
>RD_seq4 TGCTCGCCCCCCAGGAAGTGCAGAGACCGCCTGGGTGTGACTGTTTTTAGGCCTAACAAAGGCACAGAAA CACCCGTGCGGTCTCTGTATCCCCTGGAGGTATTTCTCCCCATTAGTTTGCTTGACACTAAGTTTTTAAA
>RD_seq5 TAAAAAAAGCTTATTAAGTCCCTAGAACCTGGGACCTATCTACCCAAGTTTTAAAACCTTACTTTTAAGG CTACATTTTTTTATTTTGACTGTTTTACCATAAGGTCACATATAGGAAACCCCCACTGTCCTAATAAAAA
>RD_seq6 CTAATCTCCTGTTGGCTGACTTACATCAGTTTGGGAAGTTGTTCATGATGACTCTGCGACGATCAAGAAG GACCAGGACTCTCCCTGGACACCTCAGGGACTTCTTGCTGGAGGGCACCATACATCAGTTTGCCAGCAAA
Here is my code for LinearSVC:
import pandas as pd
import numpy as np
from numpy import array
from numpy import argmax
from Bio import SeqIO
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
with open('../fasta/seqs_for_test.fasta') as fasta_file: # Will close handle cleanly
identifiers = []
sequences = []
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
identifiers.appendseq_record.id)
sequences.append(seq_record.seq.lower())
s1 = pd.Series(identifiers, name='ID')
s2 = pd.Series(sequences, name='sequence')
# Gathering Series into a pandas DataFrame and rename index as ID column
fasta_frame = pd.DataFrame(dict(ID=s1, sequence=s2)).set_index(['ID'])
fasta_frame
label_serie = pd.Series()
fasta_frame.insert(1, "label", label_serie)
# Transposable element (TE) == 0; Random (RD) == 1.
fasta_frame.loc[fasta_frame.index.str.contains(r'TE_'),'label'] = 0
fasta_frame.loc[fasta_frame.index.str.contains(r'RD_'),'label'] = 1
fasta_frame
# empty list to store ohe array sequences
res_arr = []
for index, row in fasta_frame['sequence'].iteritems():
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(row)
# print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
# print(index)
# print(onehot_encoded)
# append ohe arrays
res_arr.append(onehot_encoded)
y = fasta_frame['label']
# y
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(res_arr,
y,
test_size = 0.20,
random_state=42)
# print(x_train)
# print(y_train)
# print(x_test)
# print(y_test)
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
modelo = LinearSVC()
modelo.fit(x_train, y_train)
previsoes = modelo.predict(y_test)
acuracia = accuracy_score(y_test, previsoes) * 100
print("accuracy was %.2f%%" % acuracia)
I've tried to reshape, np.vstack and other ways but got no success. How can I use the list of arrays as my training set?
Error message:
ValueError: Found array with dim 3. Estimator expected <= 2.
you should provide which line of code got error, and also print the shape and head of res_arr. It looks like the shape of your array is not correct.
Hi, @shoujun.gu The line I got the error is
res_arr is a list of Numpy arrays. In this case, the example set, not my real data set, res_arr is composed by six arrays with shape (140, 4) for all of them. The arrays are like:
It represents each A, C, G and T in the sequences. A=[1,0,0,0], C=[0,1,0,0], G=[0,0,1,0], T=[0,0,0,1].
You need to check the shape of your real training data, since the error message showed its a 3d array, not 2d.