This is a frequently asked question, so here is a robust method to fully recapitulate the counts given by TCGA and analyse them using DESeq2
.
! Updated to reflect Data Release 32.0
GDC Product: Data - GENCODE v36 Release
Release Date: March 29, 2022
Why the long way? Tanya and I noticed via TCGA-Biolinks
and Firehose
did not generate the full count matrix. ~5-10% of genes were missing to begin with. (more on this at the end of the post)
GDC Portal
1. Add files to cart
First, you will need to go to the GDC Portal and add all of the files you want to download to your cart:
- In the left panel, select the dataset you want. This example is for
TCGA-PRAD
gene expression quantification usingSTAR
. - Add all files to cart
2. Download Manifest, Sample Sheet
GDC Client
1. Install gdc-client
tool:
wget https://gdc.cancer.gov/files/public/file/gdc-client_v1.6.1_Ubuntu_x64.zip
unzip gdc-client_v1.6.1_Ubuntu_x64.zip
sudo mv gdc-client /usr/bin && rm gdc-client_v1.6.1_Ubuntu_x64.zip
gdc-client --help
Please check if newer versions are available as the post ages: GDC Data Transfer Tool Client.
2. Download files using the client:
mkdir TCGA-PRAD
gdc-client download -m gdc_manifest_20220427_075342.txt -d TCGA-PRAD/
You will now have a directory for each patient under TCGA-PRAD/
:
$ tree TCGA-PRAD/ | head
TCGA-PRAD/
├── 0007888f-8d96-4c01-8251-7fef6cc71596
│ ├── 88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv
│ └── logs
│ └── 88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv.parcel
├── 00d6b6a9-179f-4b2a-826f-f8326197b17d
│ ├── 9c56eacc-c71b-48b5-8747-008fd1105f89.rna_seq.augmented_star_gene_counts.tsv
│ └── logs
│ └── 9c56eacc-c71b-48b5-8747-008fd1105f89.rna_seq.augmented_star_gene_counts.tsv.parcel
├── 012f91be-31a2-4e38-9922-2ee6f9b3328c
Generate count matrix in R
Note: This step is memory intensive
Please uncomment the lines below if you wish to select only protein_coding
genes (recommended).
library(data.table)
generate_count_mat <- function(path, pattern) {
files = list.files(path, pattern, full.names = TRUE, recursive=TRUE, include.dirs=TRUE)
mat = as.data.frame(do.call(cbind, lapply(files, function(x) fread(x, stringsAsFactors = FALSE))))
mat <- mat[-c(1:4),]
#gene_type <- as.character(mat[,3])
rownames(mat) = mat[,1]
mat = as.data.frame(mat[, seq(4, ncol(mat), 9)])
#mat$gene_type <- gene_type
#mat <- subset(mat, mat$gene_type == "protein_coding")
#mat <- mat[,-c(ncol(mat))]
return(mat)
}
# stage raw counts
mrna_counts <- generate_count_mat("/data/TCGA-PRAD", "\\.rna_seq.augmented_star_gene_counts.tsv$")
Uncommenting the lines above results in a count matrix with 19962 protein coding genes. Without filtering, the resulting count matrix has 60660 genes, composed of:
14 IG_C_gene
9 IG_C_pseudogene
37 IG_D_gene
18 IG_J_gene
3 IG_J_pseudogene
1 IG_pseudogene
145 IG_V_gene
187 IG_V_pseudogene
16901 lncRNA
1881 miRNA
2212 misc_RNA
2 Mt_rRNA
22 Mt_tRNA
48 polymorphic_pseudogene
10167 processed_pseudogene
19962 protein_coding
18 pseudogene
8 ribozyme
47 rRNA
497 rRNA_pseudogene
49 scaRNA
1 scRNA
943 snoRNA
1901 snRNA
5 sRNA
1057 TEC
500 transcribed_processed_pseudogene
138 transcribed_unitary_pseudogene
939 transcribed_unprocessed_pseudogene
2 translated_processed_pseudogene
1 translated_unprocessed_pseudogene
6 TR_C_gene
4 TR_D_gene
79 TR_J_gene
4 TR_J_pseudogene
106 TR_V_gene
33 TR_V_pseudogene
98 unitary_pseudogene
2614 unprocessed_pseudogene
1 vault_RNA
Fetch Metadata
You might have noticed the count matrix has non-sensical column names. The count matrix was generated based on list.files()
thus, we must re-arrange the sample sheet to match the same ordering prior to appending Sample.ID
as column names.
This step utilises the Sample Sheet Downloaded in the
GDC Portal
step.
library(tidyverse)
file_names <- list.files("/data/TCGA-PRAD", "\\.rna_seq.augmented_star_gene_counts.tsv$", full.names = FALSE, recursive = TRUE, include.dirs = FALSE)
file_names <- sub(".*/", "", file_names)
samplesheet <- read.table("gdc_sample_sheet.2022-04-27.tsv", header=T, sep="\t")
samplesheet <- samplesheet[match(file_names, samplesheet$File.Name),]
colnames(mrna_counts) <- samplesheet$Sample.ID
meta <- subset(samplesheet, select=c(Sample.ID, Sample.Type))
rownames(meta) <- NULL
meta <- column_to_rownames(mrna_meta, var="Sample.ID")
If you want to compare Tumor vs. Normal
, filter to remove samples that are neither:
meta$Sample.Type <- gsub("Primary Tumor", "Tumor", meta$Sample.Type,)
meta$Sample.Type <- gsub("Solid Tissue Normal", "Normal", meta$Sample.Type)
meta$Sample.Type <- as.factor(meta$Sample.Type)
levels(meta$Sample.Type)
> levels(meta$Sample.Type)
[1] "Metastatic" "Normal" "Tumor"
Remove the Metastatic
sample from both the counts matrix and metadata:
metastatic_key <- rownames(meta)[which(meta$Sample.Type == "Metastatic")]
mrna_counts <- mrna_counts[names(mrna_counts)[!names(mrna_counts) %in% metastatic_key]]
meta <- subset(meta, !(rownames(meta) %in% metastatic_key))
table(colnames(mrna_counts) == rownames(meta))
> table(colnames(mrna_counts) == rownames(meta))
TRUE
550
Note: If you want to perform more complex contrasts, download the
clinical sample sheet
in the GDC cart checkout page.
DESeq2
dds <- DESeqDataSetFromMatrix(mrna_counts, colData = meta, design = ~ Sample.Type)
dds$Sample.Type <- relevel(dds$Sample.Type, ref = "Normal")
dds <- DESeq(dds)
## DESeq2 results
res <- results(dds, alpha = 0.05, name = "Sample.Type_Tumor_vs_Normal")
res_df <- as.data.frame(res)
## Function to split results
get_upregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange>=1)],
rownames(df)[which(df$pvalue<=0.05)])
results <- as.data.frame((df)[which(rownames(df) %in% key),])
return(results)
}
get_downregulated <- function(df){
key <- intersect(rownames(df)[which(df$log2FoldChange<=-1)],
rownames(df)[which(df$pvalue<=0.05)])
results <- as.data.frame((df)[which(rownames(df) %in% key),])
return(results)
}
up_reg <- get_upregulated(res_df)
down_reg <- get_downregulated(res_df)
Appending Gene Names to results
The STAR
counts in release 32 come with a gene_name
column we can use to annotate our results (or use BiomaRt
..). Read one of the raw files in and use this as an ensembl_gene_id_version
to gene_name
map.
$ head TCGA-PRAD/0007888f-8d96-4c01-8251-7fef6cc71596/88215dd0-5841-44f1-9393-eefd8238cbb3.rna_seq.augmented_star_gene_counts.tsv
# gene-model: GENCODE v36
gene_id gene_name gene_type unstranded stranded_first stranded_second tpm_unstranded fpkm_unstranded fpkm_uq_unstranded
N_unmapped 1119332 1119332 1119332
N_multimapping 4161493 4161493 4161493
N_noFeature 3363885 28590771 28547217
N_ambiguous 5128483 1241035 1244999
ENSG00000000003.15 TSPAN6 protein_coding 3473 1768 1705 54.5637 16.3336 16.0655
ENSG00000000005.6 TNMD protein_coding 2 2 0 0.0966 0.0289 0.0284
ENSG00000000419.13 DPM1 protein_coding 1591 827 764 93.9366 28.1198 27.6583
ENSG00000000457.14 SCYL3 protein_coding 1263 938 957 13.0767 3.9145 3.8502
Discrepancies
GDC Portal Method:
> dim(mrna_counts)
[1] 60660 553
GDC Query Method:
library(TCGAbiolinks)
library(SummarizedExperiment)
library(biomaRt)
mrna_query <- GDCquery(project = "TCGA-PRAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
experimental.strategy = "RNA-Seq")
GDCdownload(mrna_query, method = "api", files.per.chunk = 100,
directory = "~/Desktop/TCGA/mRNA/")
mrna_df <- GDCprepare(mrna_query, directory = "~/Desktop/TCGA/mRNA/")
# extract what you need from mrna_df$ to add to metadata
mrna_meta <- mrna_df$sample
mrna_meta <- cbind(mrna_meta, mrna_df$definition)
mrna_df <- assay(mrna_df)
dim(mrna_df)
> dim(mrna_df)
[1] 56602 551
You are instantly missing 6.5% of genes compared to the gdc-client
method - apparently firehose
performs worse.
For the sake of reproducibility, one should be able to obtain the full TCGA gene set and decide for themselves what to discard.
Full credit to Tanya Aneichyk for raising these discrepancies in the first place, and to Dónal O'Shea for pointing out the need for preserving ordering using file.names()
before appending Sample.ID
.
When I performed using TCGABiolinks I'm getting the exact count. Can you cross-check with updated pipeline?
Seems like the discrepancy is fixed
.htseq.counts
are not available in the updated gdc-portal website. They provide onlySTAR - Counts
. I think this workflow needs change as perSTAR - Counts
as the file formats are different.Thanks for the heads up.
Seems to be a minor tweak - I just have to change the column indexing to grab the
unstranded
column from the STAR file. I will update the documentation to reflect theaugmented_star_gene_counts.tsv
nomenclature. Thanks again.Hey this is exactly what I was looking for since the genes I am interested in seem to be out of all the online TCGA seq datasets. However, I don't think my computer can handle this downloading all of this (I'm a noobie too). Would it be possible to host the results online somewhere or transfer them to me?
I haven't tested this code, but there is a corner case in TCGA/CTPAC-3 or any other projects that have RPPA protein array data. As you might know, all libraries are constructed in the aliquot-level, and RPPA just needs a lot more materials so the analysis center actually combined a few samples to a single aliquot, instead of the regular way (split one sample into multiple aliquot). As a consequence of that, you might see sometimes the sample_id, sample_type, etc are comma delimited value array instead a simple enum.
I didn't know about this. If you can point to a specific example I can add an addendum as an additional comment/answer. These types of tutorials benefit from inputs from the community, thanks for mentioning this.
If you just select all CPTAC-3 RNA-Seq Genomic-aligned BAMs, add them to the GDC cart and download the "sample sheet", you will see ~200 BAMs generated from combined aliquots. One example is 8912abbb-e37b-4bee-8244-55e30f2fc4bc generated from an aliquot that is a combination of C3N-00498-04 and C3N-00498-02, both are "Primary Tumor" sample_type and belong to the same case C3N-00498