Normalizing data from GEO dataset
0
0
Entering edit mode
3.9 years ago

Hi eveyone. I want to use a miRNA seq dataset, from GEO (GSE28884) as a validation dataset, to be used after a feature selection process tried on TCGA miRNA expression profiling dataset (I've to test a svm-classifier indeed). For the feature selection procedure I downloaded counts data and after applied vst function from DEseq2.

I explored the paper of the study related to this GEO dataset (Supplementary tables 1-8C , sheet S4D-mature miR) and in "Results" It is described that "miRNA profiles of a sample can be presented as relative percentage of miRNA read frequencies (rf) by dividing miRNA read counts by total miRNA reads per library. In fact, if you take a look to data, there are decimals. Now I would retrieve original counts in order to perform vst and normalize variance on this dataset as well, but I need the total number of reads per library and I could not find it.

Anyway in "material and methods" (Small RNA sequencing and bioinformatics analysis) it is explained "Setting a threshold of 10 or more reads per miRNA for the pool of all 49,479,978 miRNA sequence reads (179 patient samples and 6 cell lines), we identified a total of 888 mature and miRNA star sequences (representing the 2 strands obtained after miRNA precursor processing). could this be the total number that I need?

In brief, I can't understand what kind of normalization was applied on this miR expression profiling (sheet S4D), and vst doesn't work on decimals, so I don't know how to stabilize variance of this data.

Thank you very much, I'm a very beginner in bioinformatics and any suggestion is really appreciated.

normalization R RNA-Seq vst featureselection • 815 views
ADD COMMENT

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6