TCGA Biospecimens Slides Extraction
0
1
Entering edit mode
9 months ago
jain72744 ▴ 10

Hi,

I want to extract Tissue and diagnostic slides images from TCGA. I want to obtain normal slides, tissue slides and diagnostic slides labelled separately. Using data retrieving tools, I get svs format images and I have to then extract images from it for which I cannot find a code and also, the images coming out are not labelled for the sample type.

I tried using this code in R:

query <- GDCquery(project = "TCGA-CHOL", 
                  data.category = "Biospecimen",
                  data.type = "Slide Image", 
                  data.format = "SVS",
                  experimental.strategy="Tissue Slide",
                  sample.type ="Primary Tumor")

but sample.type gives me the error

 ~ Error in dimnames(x) <- dn : 
  length of 'dimnames' [2] not equal to array extent

OR

Error in checkBarcodeDefinition(sample.type) : 
  Primary Solid Tumor was not found. Please select a difinition from the table above

Please explain to me how can I obtain the 3 three types of images- diagnostic, normal and tissue for each patient of TCGA-CHOL and how do I open them for analysis in R or Python and the image formats used for the same.

Thank you.

tcga • 1.1k views
ADD COMMENT
1
Entering edit mode

When I looked for the parameters to use for image data from TCGA-CHOL (as outlined here: https://rdrr.io/bioc/TCGAbiolinks/f/vignettes/query.Rmd), I saw that there is no data.format for Slide Image:

> print(readr::read_csv("https://docs.google.com/spreadsheets/d/1f98kFdj9mxVDc1dv4xTZdx8iWgUiDYO-qiFJINvmTZs/export?format=csv&gid=2046985454",col_types = readr::cols()), n=50)
# A tibble: 24 × 5
   Data.category               Data.type                           `Workflow Type`                  data.format Platform
   <chr>                       <chr>                               <chr>                            <chr>       <chr>
 1 Transcriptome Profiling     Gene Expression Quantification      STAR - Counts                    NA          NA
 2 Transcriptome Profiling     Gene Expression Quantification      CellRanger - 10x Filtered Counts NA          NA
 3 Transcriptome Profiling     Gene Expression Quantification      CellRanger - 10x Raw Counts      NA          NA
 4 Transcriptome Profiling     Single Cell Analysis                NA                               TSV         NA
 5 Transcriptome Profiling     Differential Gene Expression        Seurat - 10x Chromium            NA          NA
 6 Transcriptome Profiling     Isoform Expression Quantification   -                                NA          NA
 7 Transcriptome Profiling     miRNA Expression Quantification     -                                NA          NA
 8 Transcriptome Profiling     Splice Junction Quantification      NA                               NA          NA
 9 Copy number variation       Copy Number Segment                 NA                               NA          NA
10 Copy number variation       Masked Copy Number Segment          NA                               NA          NA
11 Copy number variation       Gene Level Copy Number              NA                               NA          NA
12 Copy number variation       Allele-specific Copy Number Segment NA                               NA          NA
13 Simple Nucleotide Variation Masked Somatic Mutation             NA                               NA          NA
14 Raw Sequencing Data         -                                   NA                               NA          NA
15 Proteome Profiling          Protein Expression Quantification   NA                               NA          NA
16 Biospecimen                 Slide Image                         NA                               NA          NA
17 Biospecimen                 Biospecimen Supplement              NA                               NA          NA
18 Clinical                    -                                   NA                               NA          NA
19 DNA Methylation             Methylation Beta Value              NA                               NA          Illumina Human Methylation 450
20 DNA Methylation             Methylation Beta Value              NA                               NA          Illumina Human Methylation 27
21 DNA Methylation             Methylation Beta Value              NA                               NA          Illumina Methylation Epic
22 DNA Methylation             Masked Intensities                  NA                               NA          Illumina Human Methylation 450
23 DNA Methylation             Masked Intensities                  NA                               NA          Illumina Human Methylation 27
24 DNA Methylation             Masked Intensities                  NA                               NA          Illumina Methylation Epic

So I omitted a few params from your query to get the command to run to completion:

> query <- GDCquery(project = "TCGA-CHOL", data.category = "Biospecimen", data.type = "Slide Image")
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-CHOL
--------------------
oo Filtering results
--------------------
ooo By data.type
----------------
oo Checking data
----------------
ooo Checking if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Checking if there are results for the query
-------------------
o Preparing output
-------------------

> query
       results   project data.category   data.type access experimental.strategy platform sample.type barcode workflow.type
1 c("aca82.... TCGA-CHOL   Biospecimen Slide Image     NA                    NA       NA          NA      NA            NA

how do I open them for analysis in R or Python and the image formats used for the same.

This is an entirely different topic. Processing SVS images in R is not a straightforward task AFAIK, you might need proprietary software from Aperio.

ADD REPLY
0
Entering edit mode

So what I have to do it is separately download all of these svs files and then convert them to jpg/png/tif formats separately and then run my analysis on them, there isn't a shorter way?

ADD REPLY
0
Entering edit mode

I don't really know. I don't think you can "convert" an SVS image to other formats without losing quite a bit of information content.

ADD REPLY
0
Entering edit mode

I can not speak for tcgabiolinks, but for slides in general

  • svs files should be open in Aspera, or there are some python tools/libraries machine learning ppl uses
  • normal tissue slides will be rare or close to none in most of tcga projects.
ADD REPLY
0
Entering edit mode

svs files should be open in Aspera

Aspera is a file transfer application. Are you referring to Aperio?

ADD REPLY
0
Entering edit mode

Oh, you are right. I had a typo. It should be Aperio Image Scope. Btw, I don't like SVS format. This format is abused by different companies adding their own touch to it. Sometimes I got surprised on why one SVS is so different from another SVS.

ADD REPLY
0
Entering edit mode

I think you're in a vastly better place to help OP. I've briefly tried to process SVS files with R and failed, maybe you can give them pointers on how to approach the problem?

ADD REPLY
0
Entering edit mode

My colleagues use python pillow package.

ADD REPLY

Login before adding your answer.

Traffic: 2446 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6