TCGA CNV SNP 6.0 tumor data files: more than 1 sets for same sumbitter ID
1
2
Entering edit mode
6.9 years ago
ncafung ▴ 30

I am conducting CNV analysis on TCGA Level 3 SNP 6.0 data. In few of the downloaded tumor samples, I found more than 1 seg files associated with same TCGA submitter ID. For example,

TCGA-44-2656-01A  CUTCH_p_TCGAb_355_37_52_NSP_GenomeWideSNP_6_H10_1376764.nocnv_grch38.seg
TCGA-44-2656-01A  EGGAR_p_TCGAb33and37_SNP_N_GenomeWideSNP_6_H04_585228.nocnv_grch38.seg
TCGA-44-2656-01A  HILLY_p_TCGA_b90_wRedos_SNP_N_GenomeWideSNP_6_A04_748062.nocnv_grch38.seg

The first file showed 415 rows. While the second and third files showed 197 and 271 rows, respectively. All three files showed Mean Seg Score for Chromosomes from 1 to 22 and X.

Under this kind of situation, what factors I should consider to select one of the three files to continue my downstream analysis?
Should I combine those chromosome regions that have overlapped fully or partially among the 3 files, if I decide to combine the seg data of the 3 files?

The UUID's of the above 3 samples followed the same order are:

89327245-3da1-4a96-bee3-5b84ae43401a
f19650bb-8ead-490b-9f91-d7c4b06bfe6b
0e6071db-c44c-4958-ab95-087d44620893

Is there a means that I could find out more about the 3 samples with the above UUID's to facilitate the file selection?

TCGA CNV SNP • 2.1k views
ADD COMMENT
0
Entering edit mode
6.9 years ago

Yes, digging through these can be challenging and I do not have a direct answer for you without looking at other data. You might try this code to help determine what is "under the hood" for these file IDs. In R:

# need to install these
library(GenomicDataCommons)
library(listviewer)
res = files() %>% 
  filter(~ file_id %in% c("89327245-3da1-4a96-bee3-5b84ae43401a", 
    "f19650bb-8ead-490b-9f91-d7c4b06bfe6b",
    "0e6071db-c44c-4958-ab95-087d44620893")) %>% 
  select(available_fields('files')) %>% 
  results_all()
jsonedit(res)

This will open a viewer where you can interactively investigate the (very deeply) nested details of each file metadata. In RStudio, this looks like:

Imgur

You'll notice that these files are all derived from primary tumor, but from different aliquots. The histopath description for each associated slide implies that the sample is 85% tumor. As with any copy number data, you might need to go back further in the pipeline to define quality metrics that could help.

ADD COMMENT
1
Entering edit mode

Thanks! I got the same output as u showed above. As a newbie, would like to share that u will need the following libraries before u can execute the above R scripts.

library(devtools)

Then load the GenomeInfoDb

source('https://bioconductor.org/biocLite.R')
biocLite("GenomeInfoDb")
library(GenomeInfoDb)

biocLite('Bioconductor/GenomicDataCommons')

# if not installed the first time, then

biocLite('Bioconductor/GenomicDataCommons', 'force = TRUE')

# Check whether installed properly #

GenomicDataCommons::status()

library(GenomicDataCommons)

The the following 2 libraries

library(listviewer)
library(magrittr)

Then execute the above R script provided by Sean.

U can find more updated info on GenomicDataCommons from the URL below:

https://github.com/Bioconductor/GenomicDataCommons

ADD REPLY
0
Entering edit mode

Thanks for the great details. Minor adjustment--GenomicDataCommons is now in Bioconductor, so biocLite('GenomicDataCommons') is the preferred approach for installing.

ADD REPLY
0
Entering edit mode

One more detail...

Will need R version 3.4 or above for GenomicDataCommons to be installed properly.

I had version 3.3.3 originally, and the initial installation attempt failed.

ADD REPLY

Login before adding your answer.

Traffic: 1575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6