hi. i got the mRNA data from TCGA by R code(the FPKM data), and the protein data from TCPA; and an example of my duplicated data is like below:
TCGA-HZ-A9TJ-01A-11R-A41I-07
TCGA-HZ-A9TJ-06A-11R-A41B-07
TCGA-H6-A45N-01A-11R-A26U-07
TCGA-H6-A45N-11A-12R-A26U-07
the R code that i got data with is below:
library(TCGAbiolinks)
library(dplyr)
library(DT)
library(SummarizedExperiment)
1
query1 <- GDCquery(project = "TCGA-PAAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
workflow.type = "HTSeq - FPKM-UQ"
df <- GDCprepare(query1,
save=TRUE,
save.filename = "TCGA-PAAD_dataframe.rda",
summarizedExperiment = FALSE)
write.csv(df, file = "count.csv")
2
query <- GDCquery(project = "TCGA-PAAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2
GDCdownload(query)
Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
rsem.genes.results as values
PAADRnaseqSE <- GDCprepare(query)
PAADMatrix <- assay(PAADRnaseqSE,"HTSeq - Counts") # or PAADMatrix <- assay(PAADRnaseqSE,"raw_count")
For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run
PAADRnaseq_CorOutliers <- TCGAanalyze_Preprocessing(PAADRnaseqSE)
quantile filter of genes
dataFilt <- TCGAanalyze_Filtering(tabDF = PAADRnaseq_CorOutliers,
method = "quantile",
qnt.cut = 0.25)
selection of normal samples "NT"
samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
typesample = c("NT"))
selection of tumor samples "TP"
samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt),
typesample = c("TP"))
Diff.expr.analysis (DEA)
dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT],
mat2 = dataFilt[,samplesTP],
Cond1type = "Normal",
Cond2type = "Tumor",
fdr.cut = 0.01 ,
logFC.cut = 1,
method = "glmLRT")
DEGs table with expression values in normal and tumor samples
dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal",
dataFilt[,samplesTP],dataFilt[,samplesNT])
write.csv(dataDEGsFiltLevel, file = "DEGs.csv")
<h6>#########################################</h6>
thanks for answering. but because i want to integrate my data with protein data, i have to use a part of the TCGA barcode(the third part that is for "participant") e.g: TCGA-02-0001-01C-01D-0182-01: in this barcode 0001 is for participant that i should get.
Specifically what data are you working on? Where do you get the data from? Could you post an example of a duplicated sample id?