Hi Alex,
Does the samples in the metadata should match with the order in the kmer matrix header? Supposedly I have a dumb dataset of 5 species:
kmer matrix header:
kmers t1 t2 t3 t4 t5
Metadata:
t5 0
t1 1
t3 0
t2 1
t4 0
Will this cause any problem?
I thought so. But when I rearrange the order of the metadata (my metadata came in random order so I reordered them) the results are consistently different (same results for the same ordering but different for different ordering). I tried 4~5 version of the metadata (only vary by the order) and they all give different answers. I noticed this because I was doing some simulations so I knew what kmers should be picked up. Kover did very well with a real dataset but failed in all of my simulations. Since the metadata in my real dataset was in order, I thought it might somehow be related to the random order of metadata in my simulations. However even if I reordered them, I still could not get the kmers that I defined the phenotype with. Shall I send you my simulated dataset (several Mbs) to play with?
It would definitely help me look into this if you were able to share your data. Would you be able to upload it to a server (e.g.: https://mega.nz/) and share the link?
Also, can you include the kover commands that you are using to create and split the data? Did you set the random seed parameter in the "kover dataset split" command? If not, varying results are to be expected, since the examples in the training and testing set are different each time.
https://mega.nz/#F!n053zISY
No key needed.
kover commands used:
kover dataset create from-tsv --genomic-data kmerMatrix.tsv --phenotype-name "rpoBsimulation" --phenotype-metadata metadata.tsv --output temp.kover
kover dataset split --dataset temp.kover --id temp_split --train-size 0.666 --folds 5 --random-seed 72
kover learn --dataset temp.kover --split temp_split --model-type conjunction disjunction --p 0.1 1.0 10.0 --max-rules 5 --hp-choice cv --n-cpu 10
Yes I did set the random seed and used the same one during the trials.
Please let me know if you have problem accessing the data.
Thank you.