Question

Why "Parse XML clinical data" in TCGAbiolinks gives error in Rstudio ?

1

Entering edit mode

7.1 years ago

Björn ▴ 110

The following command for TCGA - clinical data gives error in TCGAbiolinks. I am trying to parse XML clinical data

query <- GDCquery(project = "TCGA-PRAD", 
              data.category = "Clinical")


radiation<-GDCprepare_clinic(query, clinical.info = "radiation")

|==== | 4% Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) : Start tag expected, '<' not found [4]

I assume it is because of big file size. How can I solve the problem ?

tcgabiolinks xml clinical data • 3.7k views

ADD COMMENT • link updated 7.1 years ago by afsverissimo • 0 • written 7.1 years ago by Björn ▴ 110

0

Entering edit mode

tag expected, '<' not found [4]

it's because it's not a xml file, or the file is corrupted.

test this with:

xmllint --stream --noout your.xml

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks,I have nearly 600 individual xml files in individual folders which is standard download from TCGA. So, no idea how to sort it out?

ADD REPLY • link 7.1 years ago by Björn ▴ 110

0

Entering edit mode

find /path/to/dir  -type f -name "*.xml" -exec xmllint --stream --noout '{}' ';'

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

facing the same problem, all of them are valid xml

edit: for KIRC project instead, something weird as there should be 627 patients, but only 621 xml files in clinical directory

edit2: all files in results from query exist, so there are multiple references for same file

ADD REPLY • link 7.1 years ago by afsverissimo • 0

0

Entering edit mode

actually, the code is trying to read a txt file as xml -_-

I'm guessing this is a problem with tcgabiolinks trying to parse wrong files or not recognizing that it is not a xml

find /path/to/dir -type f -name "*.txt"

ADD REPLY • link 7.1 years ago by afsverissimo • 0

score 0 · Answer 1 · 2018-06-20

0

Entering edit mode

7.1 years ago

afsverissimo • 0

According to Bioconductor support website, you need to filter out xml files

query <- GDCquery(project = 'TCGA-KIRC', data.category = "Clinical",file.type = "xml")

Here is the link: https://support.bioconductor.org/p/110056/#110057

I haven't been able to confirm this as TCGAbiolinks is saying that GDC server is down.

edit: solution works for me. I solved the last problem by using the github version of the package

ADD COMMENT • link 7.1 years ago by afsverissimo • 0

1

Entering edit mode

Install latest version from GitHub. A: GDC server down???

ADD REPLY • link 7.1 years ago by GenoMax 152k

0

Entering edit mode

thanks a lot, it solved the problem!

ADD REPLY • link 7.1 years ago by afsverissimo • 0