Question

Kover - order of metadata

0

Entering edit mode

7.8 years ago

hfan22 ▴ 40

Hi Alex,

Does the samples in the metadata should match with the order in the kmer matrix header? Supposedly I have a dumb dataset of 5 species:

kmer matrix header:

kmers t1 t2 t3 t4 t5

Metadata:

t5 0
t1 1
t3 0
t2 1
t4 0

Will this cause any problem?

kover • 2.0k views

ADD COMMENT • link updated 7.8 years ago by alexandre.drouin1 • 0 • written 7.8 years ago by hfan22 ▴ 40

score 0 · Answer 1 · 2017-01-24

0

Entering edit mode

7.8 years ago

alexandre.drouin1 • 0

No, the order of the metadata is not important. The only thing that matters is that the identifiers are the same. Kover will automatically match the data between the k-mer matrix and the metadata based on the identifiers.

If you are interested, this is handled in https://github.com/aldro61/kover/blob/master/core/kover/dataset/create.py (lines 161 to 170).

Edit:

After looking at hfan22's data, we concluded that this was normal behaviour and that the order of the metadata is not important.

For computational reasons, Kover reorders the learning examples to group them by class (e.g.: 0 0 0 0 1 1 1 1). When the metadata were randomly shuffled, the order of the examples within a class changed. In other words, the first example with label 0 was not the same after shuffling. Therefore, the order of the examples in the resulting Kover dataset was different, resulting in a different random train/test split and thus, slightly different metrics.

ADD COMMENT • link 7.8 years ago by alexandre.drouin1 • 0

0

Entering edit mode

I thought so. But when I rearrange the order of the metadata (my metadata came in random order so I reordered them) the results are consistently different (same results for the same ordering but different for different ordering). I tried 4~5 version of the metadata (only vary by the order) and they all give different answers. I noticed this because I was doing some simulations so I knew what kmers should be picked up. Kover did very well with a real dataset but failed in all of my simulations. Since the metadata in my real dataset was in order, I thought it might somehow be related to the random order of metadata in my simulations. However even if I reordered them, I still could not get the kmers that I defined the phenotype with. Shall I send you my simulated dataset (several Mbs) to play with?

ADD REPLY • link 7.8 years ago by hfan22 ▴ 40

0

Entering edit mode

It would definitely help me look into this if you were able to share your data. Would you be able to upload it to a server (e.g.: https://mega.nz/) and share the link?

Also, can you include the kover commands that you are using to create and split the data? Did you set the random seed parameter in the "kover dataset split" command? If not, varying results are to be expected, since the examples in the training and testing set are different each time.

ADD REPLY • link 7.8 years ago by Alexandre Drouin ▴ 90

0

Entering edit mode

https://mega.nz/#F!n053zISY
No key needed.

kover commands used:
kover dataset create from-tsv --genomic-data kmerMatrix.tsv --phenotype-name "rpoBsimulation" --phenotype-metadata metadata.tsv --output temp.kover
kover dataset split --dataset temp.kover --id temp_split --train-size 0.666 --folds 5 --random-seed 72
kover learn --dataset temp.kover --split temp_split --model-type conjunction disjunction --p 0.1 1.0 10.0 --max-rules 5 --hp-choice cv --n-cpu 10

Yes I did set the random seed and used the same one during the trials.

Please let me know if you have problem accessing the data.

Thank you.

ADD REPLY • link 7.8 years ago by hfan22 ▴ 40