I've converted the hits to 1's and 0's to represent whether or not they are enzyme/non-enzymes. For each query protein, I've ~5hits and therefore, I can use these 5hits as 5 training features.
> tmp
protein class
ProtA 1
ProtA 1
ProtA 1
ProtA 0
ProtA 1
ProtB 1
ProtB 1
ProtB 0
ProtB 0
ProtC 1
ProtD 1
ProtD 0
# install & load packages
library(reshape2)
library(plyr)
library(doMC)
registerDoMC(8) # assign 8 cores for ddply to use
# define function myfunc
myfunc <- function(x)
{
dim(x)
x$name = paste("Class",1:nrow(x))
dcast(x,protein~name,value.var = "class")
}
# call myfunc in ddply. run ddply in parallel mode to use >1 cores
res = ddply(.data = tmp,.variables = "protein",.fun = function(x) myfunc(x),.parallel=T)
# output
> res
protein Class 1 Class 2 Class 3 Class 4 Class 5
ProtA 1 1 1 0 1
ProtB 1 1 0 0 NA
ProtC 1 NA NA NA NA
ProtD 1 0 NA NA NA
tries to create a column with the header "name" containing 5 rows with values "Class1", "Class2"..."Class 5". The error occurs because you have more than 5 existing rows.
This code worked using the example that he has provided. I thought each protein has 5 hits. If each protein has 5 hits, then this code should work. I misunderstood the question because of the example.
As I checked through the blast results, there are few queries with more or less than 5 hits as well. I don't know how that happened, since I've used the max_target_seqs option while using BLAST.
Yes, I forgot to load the reshape2 package. I tried the script on the sample data which you've used and it seemed to work fine.
It is running on my data. SInce the data is pretty large, it's too memory intensive as well. I'll let you know of the output, when it completes executing.
Thanks a lot for your continuous help. Really appreciate that :)
I have updated my answer, you can use ddply in a parallel mode. I have added the parallel argument in the ddply function call. If you are running this on a cluster, you can give a reasonable number of cores. In the example, ddply uses 8 processors. You can change the number of cores in the registerDoMC() call.
It seem to do the work, but with some errors. It misses out few entries every now or then, and thus doesn't combine all the hits from the table to the matrix.
For example:
There are 5 entries for sp|Q9UKW4|VAV3_HUMAN in the table, but in the output, there is only 1.
I checked the awk script with some examples I created. I noticed that whenever the class value of the first entry (of the 5 entries) is zero, it skips that entry. So, I modified the script to treat the class values as strings and it seems to work. Can you check the modified script?
awk -F"\t" '{if(a[$1])a[$1]=a[$1]"\t"$2; else a[$1]=toupper($2);}END{for (i in a)print i, a[i];}' OFS="\t" YOUR_FILE
What is "Class"?
I've converted the hits to 1's and 0's to represent whether or not they are enzyme/non-enzymes. For each query protein, I've ~5hits and therefore, I can use these 5hits as 5 training features.