I am new to bioinformatics, and to learn more, I am starting by working on a project. I collected miRNA sequence data from the TCGA and it has a text file for each sample and the file includes:
miRNA_ID read_count reads_per_million_miRNA_mapped cross-mapped
Following is the sample content of file
miRNA_ID read_count reads_per_million_miRNA_mapped cross-mapped
hsa-let-7a-1 55243 9869.306676 N
hsa-let-7a-2 110572 19753.97748 Y
hsa-let-7a-3 55555 9925.046293 N
hsa-let-7b 94076 16806.92386 N
hsa-let-7c 11209 2002.517215 Y
hsa-let-7d 1843 329.256778 N
hsa-let-7e 7786 1390.989298 N
hsa-let-7f-1 166 29.656335 N
hsa-let-7f-2 66277 11840.55968 N
hsa-let-7g 4192 748.911782 N
hsa-let-7i 3617 646.186526 N
hsa-mir-1-1 0 0 N
hsa-mir-1-2 266 47.521597 N
... (trimmed)
How should I preprocess the data? I am not sure how to bring the read count to a range in between 0 and 1 for classification? Should I map the value?
value(i)=valuei−valuemin(valuemax−valuemin)
value(i)=valuei−valuemin(valuemax−valuemin)
Which one one of the columns is best suitable to be used for machine learning, reads per million or read count?
Thanks in advance.
P.S. This isn't just direct asking, I tried a bunch of things and results did not came out as expected, so the question is not effortless :p
What is the biological question you are trying to answer?