The following command for TCGA - clinical data gives error in TCGAbiolinks. I am trying to parse XML clinical data
query <- GDCquery(project = "TCGA-PRAD",
data.category = "Clinical")
radiation<-GDCprepare_clinic(query, clinical.info = "radiation")
|==== | 4% Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Start tag expected, '<' not found [4]
I assume it is because of big file size. How can I solve the problem ?
it's because it's not a xml file, or the file is corrupted.
test this with:
Thanks,I have nearly 600 individual xml files in individual folders which is standard download from TCGA. So, no idea how to sort it out?
facing the same problem, all of them are valid xml
edit: for KIRC project instead, something weird as there should be 627 patients, but only 621 xml files in clinical directory
edit2: all files in results from query exist, so there are multiple references for same file
actually, the code is trying to read a txt file as xml -_-
I'm guessing this is a problem with tcgabiolinks trying to parse wrong files or not recognizing that it is not a xml
find /path/to/dir -type f -name "*.txt"