I am new to these Bioinformatics and related machine learning like things. And I am beginning my project on protein classification using machine learning. What I do have is two fasta files of two classes of proteins. To do machine learning on it, I need to convert it into a .csv file having features. I have no idea where to start with. It would be a great support if anyone could help me load the AA indices from here: ftp://ftp.genome.jp/pub/db/community/aaindex/.
I am attaching the photo of my fasta file along with this here: https://ibb.co/CHNzvnH
And thanks in advance.
There are many ways to create protein features. Some of them are very fast but not necessarily very discriminate in the end. Since I have no idea how comfortable you might be using command-line tools vs web servers, here is a quick list of both:
Separately, I recommend SPBuild as a very good feature generator, that also happens to be fast.
All of these were in the very fast category. To generate protein features in a way that allows you to do best classification, one most likely will need to do so from protein alignments that capture sequence conservation. This usually requires lots of sequence searching, and it isn't fast because protein databases are on the order of hundreds of millions. In a nutshell: 1) do iterative searching with a given sequence using tools such as BLAST or HHpred; 2) make a multiple alignment of the query and all the matches, and extract frequencies of all amino-acids for each alignment column. In the end it will look something like this:
Maybe break the whole thing down into subtasks:
You have:
You want:
Tasks: