Entering edit mode
7.0 years ago
lur_murad
•
0
I am trying to download the van't Veer et al. (2002) breast cancer dataset using breastCancerNKI package on Bioconductor. This is my code
library("breastCancerNKI")
library("affy")
library("Hmisc")
# Load the data
data(nki)
data <- exprs(nki)
dim(data)
# Annotate with gene symbol
genesymbol <- nki@featureData@data$HUGO.gene.symbol
row.names(data) <- genesymbol
# Reduce the data set
data <- data[-whichis.na(row.names(data)) == TRUE),]
in abstract(nki), I notice that there should be "151 had lymph-node-negaitve disease, and 144 had lymph-node-positive disease." However, in exprs(nki), there are 337 observations. It is unclear which are the 144 + 151 = 295 observations. Is there someone has the data with the sample labe. So I can know the samples phenotype? Thanks in advance
Have you considered contacting the authors of the paper?
There is not email in the paper!!
Van't Veer's email address is here http://cancer.ucsf.edu/people/profiles/vantveer_laura.3358
Thank you for your help @shussainather
This issue appears to go all the way back to 2012 and was never fixed: https://github.com/ramhiser/datamicroarray/issues/5
You may also want to follow-up on GitHub.
I think that it can be explained by the fact that the dataset include patients from 2 studies, some of which may have been used in both [studies]. It's just not clear, though.
On their Bioconductor page, they state that the Van de Vijver dataset was [edit:] 117 patients, which I found through this command:
Perhaps those other 220 are extra samples used in the second study, by van't Veer. However, it really doesn't add up that much because in the
abstract()
they also state that all patients were under 53 years old, but there are patients in their 60s in the dataset.Thank you Kevin I read more than an article they mentioned that the data contain 217 Normal vs 78 Cancer I tried this
I got annotation table but it is not clear how they label the samples (treatment 0 1 2)!!
I think that it is indeed the column #3 ('series') in the pData that is the key. I just found this comment by someone who worked on it (I assume):
[source: http://grokbase.com/p/r/bioconductor/11asxgmh4j/bioc-breastcancernki-patients-question]
Sounds a bit suspicious to me. How difficult can it be to just make a study's data available to the public? One can only assume that there are other issues relating to the protection of this data that no-one wants to admit. This has happened before when 'important' data has been published.