1

Question

deleting TCGA replicated samples

0

Entering edit mode

5.3 years ago

. • 0

hi everyone, im working on TCGA data. i want to have unique samples, but there are replicates in my samples and i dont know how to do this. i dont know whether getting median for the replicated samples is appropriate or not(because for solid tumors the 2 samples i try to get median for, might have completely different spatial heterogeneity).

RNA-Seq TCGA aggregate • 2.5k views

ADD COMMENT • link 5.3 years ago by . • 0

0

Entering edit mode

thanks for answering. but because i want to integrate my data with protein data, i have to use a part of the TCGA barcode(the third part that is for "participant") e.g: TCGA-02-0001-01C-01D-0182-01: in this barcode 0001 is for participant that i should get.

ADD REPLY • link 5.3 years ago by . • 0

0

Entering edit mode

Specifically what data are you working on? Where do you get the data from? Could you post an example of a duplicated sample id?

ADD REPLY • link 5.3 years ago by Kristoffer Vitting-Seerup ★ 4.2k

score 0 · Answer 1 · 2020-01-03

Don't get median value. You need to select one of them. Maybe you need to get full tcga barcode or more other information like is_ffpe or not to help you select only sample.

This page explains TCGA barcode. You need to download other relative files like _MANIFEST.txt_, _metadata file_ where you can get more information about your sample/data.
Example full barcode from metadata

  "associated_entities": [
    {
      "entity_id": "90e6e8a1-98b3-4f38-92ef-df460d78d657", 
      "case_id": "ada19f65-5256-4c79-b3b9-7b9da69be437", 
      "entity_submitter_id": "TCGA-E7-A97Q-01A-11R-A38B-07", 
      "entity_type": "aliquot"
    }
  ],

score 0 · Answer 2 · 2020-01-03

hi. i got the mRNA data from TCGA by R code(the FPKM data), and the protein data from TCPA; and an example of my duplicated data is like below: TCGA-HZ-A9TJ-01A-11R-A41I-07 TCGA-HZ-A9TJ-06A-11R-A41B-07

TCGA-H6-A45N-01A-11R-A26U-07 TCGA-H6-A45N-11A-12R-A26U-07

the R code that i got data with is below: library(TCGAbiolinks) library(dplyr) library(DT) library(SummarizedExperiment)

1

query1 <- GDCquery(project = "TCGA-PAAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")

workflow.type = "HTSeq - FPKM-UQ"

df <- GDCprepare(query1, save=TRUE, save.filename = "TCGA-PAAD_dataframe.rda", summarizedExperiment = FALSE)
write.csv(df, file = "count.csv")

2

query <- GDCquery(project = "TCGA-PAAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")

Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2

GDCdownload(query)

Prepare expression matrix with geneID in the rows and samples (barcode) in the columns

rsem.genes.results as values

PAADRnaseqSE <- GDCprepare(query)

PAADMatrix <- assay(PAADRnaseqSE,"HTSeq - Counts") # or PAADMatrix <- assay(PAADRnaseqSE,"raw_count")

For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run

PAADRnaseq_CorOutliers <- TCGAanalyze_Preprocessing(PAADRnaseqSE)

quantile filter of genes

dataFilt <- TCGAanalyze_Filtering(tabDF = PAADRnaseq_CorOutliers, method = "quantile", qnt.cut = 0.25)

selection of normal samples "NT"

samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = c("NT"))

selection of tumor samples "TP"

samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = c("TP"))

Diff.expr.analysis (DEA)

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT], mat2 = dataFilt[,samplesTP], Cond1type = "Normal", Cond2type = "Tumor", fdr.cut = 0.01 , logFC.cut = 1, method = "glmLRT")

DEGs table with expression values in normal and tumor samples

dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal", dataFilt[,samplesTP],dataFilt[,samplesNT]) write.csv(dataDEGsFiltLevel, file = "DEGs.csv")