Hi eveyone. I want to use a miRNA seq dataset, from GEO (GSE28884) as a validation dataset, to be used after a feature selection process tried on TCGA miRNA expression profiling dataset (I've to test a svm-classifier indeed).
For the feature selection procedure I downloaded counts data and after applied vst
function from DEseq2
.
I explored the paper of the study related to this GEO dataset (Supplementary tables 1-8C , sheet S4D-mature miR) and in "Results" It is described that "miRNA profiles of a sample can be presented as relative percentage of miRNA read frequencies (rf) by dividing miRNA read counts by total miRNA reads per library. In fact, if you take a look to data, there are decimals. Now I would retrieve original counts in order to perform vst
and normalize variance on this dataset as well, but I need the total number of reads per library and I could not find it.
Anyway in "material and methods" (Small RNA sequencing and bioinformatics analysis) it is explained "Setting a threshold of 10 or more reads per miRNA for the pool of all 49,479,978 miRNA sequence reads (179 patient samples and 6 cell lines), we identified a total of 888 mature and miRNA star sequences (representing the 2 strands obtained after miRNA precursor processing). could this be the total number that I need?
In brief, I can't understand what kind of normalization was applied on this miR expression profiling (sheet S4D), and vst
doesn't work on decimals, so I don't know how to stabilize variance of this data.
Thank you very much, I'm a very beginner in bioinformatics and any suggestion is really appreciated.