Hi
I would be very grateful for some help.
I am trying to use WGCNA to perform network analysis on TCGA RNASeq data.
I am at the Data Input and Cleaning stage, after using clustering to exclude outlying samples I am having difficulty making my clinical trait data to align and match with my RNASeq data. Computer says no but I don't understand why. Ive been following the r code prompts from https://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-01-dataInput.pdf and also their r code chapter that I found through google in trying to troubleshoot this problem but code from neither works.
Let me describe my pipeline to date.
r set up and working directory set to where the data files are. I was advised to disableWGCNAThreads() at the beginning which I did (by typing "dsiableWGCNAThreads() ). Loaded read data as data, and clinical traits as traits. The file of data reads was formatted with all the gene names in the 1st column/A of my csv file, with TCGA patient IDs in the top row with patient data in columns as advocated by WGCNA handbook. The clinical trait data was formatted in the same way.
I transposed the datafile:
datExpr0=as.data.frame(t(data[, -c(1)]))
names(datExpr0) = data$GeneID
rownames(datExpr0) = names(data)[-c(1)]
I then checked for too many missing values with gsg:
gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK
No missing values. So far so good.
made my sample cluster tree, applied my cut off and kept the remaining cluster (removing 10 from my 41 samples). Told r to keep these:
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes = ncol(datExpr)
nSamples = nrow(datExpr)
Loaded trait data. This had 16 fields for the original 42 samples.
dim(traits)
[1] 16 42
There were no extra data columns that needed removing so formatting looked like this with no -c() command:
allTraits = traits[,]
allTraits = allTraits[, c(1, 2:42) ]
dim(allTraits)
[1] 16 42
But now I reach the point where I am meant to create a data frame for the clinical trait data to parallel the clustered data file that will only contain data for the 31 patients Ive filtered out by clustering and this is where it goes wrong:
# Form a data frame analogous to expression data that will hold the clinical traits.
Samples = rownames(datExpr);
traitRows = match(Samples, allTraits$Trait);
datTraits = allTraits[traitRows, -1];
rownames(datTraits) = allTraits[traitRows, 1];
r cant do it and it says:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names':
It seems to think Im working on creating the same data frame rather than 2? How has it got this impression from the code? I tried reenabling the WGCNA threads in case I had removed its capacity to create 2 data frames simulatensouly by removing the files, typing enableWGCNAThreads()
and repeating the above but it did not work.
Can anyone in the world help me with this?
Thank you very much Jude
I believe the error is just that you are attempting to set a vector of non-unique values as the rownames to a data-frame (datTraits), which is not permitted in R (note that it is permitted for a data-matrix).
What is the output of
allTraits$Trait
? I believe this should merely be a vector of sample names / IDs that match those in the expression matrix used for network construction.That doesn't work either, all I have is a column of numbers from 1 to 31 (for number of samples) and an empty column next to it that says NA in it. I do not have sufficient coding knowledge to know how to fix this myself. I am perplexed that the r coding in the WGCNA handbook would be so redundant :(
That's a major issue generally, in bioinformatics.
Can you please paste the output of
rownames(datExpr)
andallTraits$Trait
?im having a similar issue. were you able to resolve this?