I have an experimental design in which I received counts data for different treatment conditions after transfection with shRNAs. The counts data from each shRNA should serve as a proxy for whether or not the gene associated with said shRNA was associated with preferential survival or death in the given treatment conditions.
The obstacle, however, is that the library used has multiple shRNAs for all of the genes. In some respects, this is a similar issue to a previous post I made with respect to PantherDB (one to which I unfortunately received no replies/insights). The EdgeR documentation (page 26) indicates that the row names should be the names of the genes in order for the program to conduct GO/pathway analysis. This requirement, however, is an obstacle for me given that I cannot have duplicate row names and genes may have more than one associated row.
As of right now, I have the row names of my dataframe set to the gene names followed by the corresponding gene number (if over one). For example, the following are all row names:
C1QL4
, C1QL4_2
, C1QL4_3
, etc.
Is there a particular way in which I could conduct Go analysis given this constraint? I was considering using a selected 'row' for each gene based on p-value and log2Fold change but do not know if this is considered a valid approach.
If the above is not a valid approach:
- How can I account for the fact that there are multiple 'data points' for each gene with respect to p-value and log2Fold change?
If the above is a valid approach:
- How exactly can I determine which 'row' for each gene should serve as my 'exemplar'? Is there scientific validity to choosing the row with the largest log2Fold change after some p-value thresholding? Or is that perhaps cherry-picking data?