Question

How to deal with the abnormal samples from TCGA found in a PCA?

0

Entering edit mode

5.4 years ago

Yun ▴ 230

I have downloaded a dataset with 551 samples through TCGAbiolinks in R, and I prepared to do diffrentially expressed analysis, howerver, when I did data data quality assessment by sample distance clustering and PCA, there seemed some samples are abnormal, what should I do to them? Here is the picture. pca with data variance stabilized by DESeq2 function "vst"

heatmap of sample distance

RNA-Seq R rna-seq next-gen • 2.9k views

ADD COMMENT • link 5.4 years ago by Yun ▴ 230

Kevin Blighe · Answer 1 · 2019-07-21

1

Entering edit mode

5.4 years ago

Kevin Blighe 88k

There is no issue, depending on your perspective. The NT samples are from normal tissue. The TP samples are from tumour tissue. It is well documented that biopsies taken from 'normal' / stromal tissue surrounding the primary tumour can resemble the tumour itself. This is where we enter the debate of when cancer begins. A geneticist may say that it begins in the very first cell that accumulates a point mutation in, e.g., an oncogene, but this still may occur many years before conventional diagnostic / screening methods can detect a palpable tumour.

The other possibility is that there is a sample mix-up, either in your code, the inner code of TCGAbiolinks, and / or at the TCGA source.

Kevin

ADD COMMENT • link 5.4 years ago by Kevin Blighe 88k

1

Entering edit mode

Hi Kevin,

I wanted to add on a quick side question to this thread (I don't think it warrants a new post, still discussing sample variation present in TCGA-PRAD DE analysis)

When creating the design for DESeq2, the simplest way to do it is design = ~ Condition where condition has tumor/normal information. I was under the impression it is essential to account for patient variation in the dataset to maximise the Tumor vs. Normal comparison, thus making a metadata file like so:

Patient Condition
1   Tumor
2   Tumor
3   Normal
4   Tumor
5   Normal
...

And using design = ~ Patient + Condition for DESeq. Have I got the right idea here or is the simpler design preferable.

Best,

Barry

ADD REPLY • link 4.4 years ago by Barry Digby ★ 1.3k

1

Entering edit mode

Yes, that is indeed a good recommendation.

ADD REPLY • link 4.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for your quickly reply , dose it mean that normally some normal samples should surrounded by tumour samples in PCA and their distance may not cluster in a heatmap? and I'll check if there are some same samples to confirm no mix-up samples from TCGA. here is my code. Thanks you to remove my doubts, I have been Googling this problem for one day.

library(TCGAbiolinks)
library(DESeq2)
library(pheatmap)
counts_query <- GDCquery(project = "TCGA-PRAD",
                           data.category = "Transcriptome Profiling",
                           data.type = "Gene Expression Quantification",
                           workflow.type = "HTSeq - Counts")
GDCdownload(counts_query, files.per.chunk = 8)
counts_se <- GDCprepare(counts_query, save = TRUE,
                        save.filename = "counts_prostate.rda", 
                        summarizedExperiment = TRUE)
counts_se$shortLetterCode <- factor(counts_se$shortLetterCode) 
counts_dds <- DESeqDataSet(counts_se, design = ~shortLetterCode)
counts_dds$shortLetterCode
summary(counts_dds$shortLetterCode)
counts_dds <- counts_dds[,counts_dds$shortLetterCode != "TM"]
counts_dds$shortLetterCode <- droplevels(counts_dds$shortLetterCode)
counts_dds$shortLetterCode <- relevel(counts_dds$shortLetterCode, ref = "NT")
nrow(counts_dds)
counts_dds <- counts_dds[rowSums(counts(counts_dds)) > 1,]
nrow(counts_dds)
vst_counts <- vst(counts_dds, blind = F)
CairoPNG(filename = "vst_meansdplot.png")
meanSdPlot(assay(vst_counts), ranks = T)
dev.off()
vst_dist <- dist(t(assay(vst_counts)))
dist_matrix <- as.matrix(vst_dist)
annotation <- data.frame(vst_counts$shortLetterCode)
colnames(dist_matrix) <- rownames(annotation)
rownames(dist_matrix) <- rownames(annotation)
CairoSVG(file = "sample_distance_heatmap.svg")
pheatmap(dist_matrix, 
         clustering_distance_cols = vst_dist,
         clustering_distance_rows = vst_dist,
         show_colnames = F,
         show_rownames = F,
         annotation_col = annotation,
         annotation_row = annotation,
         annotation_names_col = F,
         annotation_names_row = F,
         main = "sample-distance")
CairoPNG(filename = "vst_pca.png")
DESeq2::plotPCA(vst_counts, intgroup = "shortLetterCode")
dev.off()

ADD REPLY • link 5.4 years ago by Yun ▴ 230

1

Entering edit mode

It makes sense to me. For one, PCA is looking at co-variation across your samples. Your heatmap, on the other hand, is based on a distance matrix that is calculated by Euclidean distances between your samples.

Looking again at your PCA bi-plot, the percent (%) variation explained by PC1 is relatively low, which indicates that there is no great separation across your samples. This is then confirmed in the heatmap.

I am not too familiar with the PRAD TCGA dataset, though. Is a lack of heterogeneity expected for Prostate Adenocarcinoma?

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Yes, and I don't know how to deal with these samples.

ADD REPLY • link 5.4 years ago by Yun ▴ 230

0

Entering edit mode

What is the aim of your study?

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

To find the differentially expressed gene in prostate cancer, following by GO , KEGG analysis.Sorry for reply so late, the site biostar restrict the number of the newer post.

ADD REPLY • link 5.4 years ago by Yun ▴ 230

0

Entering edit mode

I see, well, you can conduct the analysis and then see what you find. Judging by the PCA bi-plot, the list of differentially expressed genes will not be large.

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

I did it, the code and results are below:

counts_de <- DESeq(counts_dds)
CairoPNG(filename = "de_despest.png")
plotDispEsts(counts_de)
dev.off()
counts_res <- results(counts_de, 
                      contrast = c("shortLetterCode", "TP", "NT"), 
                      alpha = 0.05)
summary(counts_res)

summary(counts_res)

out of 54091 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 12299, 23%
LFC < 0 (down)     : 8661, 16%
outliers [1]       : 0, 0%
low counts [2]     : 15731, 29%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Can the abnormal samples be a problem for my workflow? whether is it necessary for me to omit them? I have read some tutorial, they all don't refer to the solusion of the samples with a lack of heterogeneity.

ADD REPLY • link updated 5.4 years ago by Kevin Blighe 88k • written 5.4 years ago by Yun ▴ 230

0

Entering edit mode

From what I can see, for RNA-seq studies in cancer, unless you have a true normal sample group, then perhaps it is better to just look at the tumour samples. You can perform differential expression comparisons between the tumour sub-type, for example (I am not sure of the histopathological sub-types of prostate cancer); you can also retain highly expressed / not expressed genes based on Z-score cut-offs (this is how cBioPortal does it).

Basically, the TCGA never performed RNA-seq on a large number of normal samples, and the 'normals' that they used are, as you can see, mostly not normal at all. Some of the cancers do have true normal cells derived from, I believe, blood lymphocytes, but these are in a severe minority.

So, comparisons in terms of dififerntial expression between tumour and normal are not suitable for different reasons.

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

I'm a bit confused about the control samples selection between a true normal sample with tissue from a no-cancer individual and a normal tissue from a cancer people for RNA-seq, I can understand it is a better choice to chose a true normal sample from what you said, but in my opinion, it may be good choice to select a control sample with tissue from a cancer, because I think a matched group need control some condition which aren't experiment design. Can you tell me some reasons for the choice of a true normal sample.

I have just now read a post(TCGA: Does TCGA cancer studies have mRNA expression data for Control/Normal samples? ) where I can find there are both true normal samples(from people without cancer) and a normal sample from normal tissue(individuals have cancer) in TCGA with a different barcode(Tumor types range from 01 - 09, normal types from 10 - 19 and control samples from 20 - 29 So 0 are tumors and 1 are normals e.g, 01 are primary tumours). So it can be a method if I can filter these samples, can it?

I have also find a paper(https://www.biorxiv.org/content/10.1101/336909v1.full) where it refers to a database GTEx to be a reference normal tissue sample, can it be a choice?

ADD REPLY • link 5.4 years ago by Yun ▴ 230

0

Entering edit mode

Well, yes, you can obtain the true 'blood' normals, if you wish; however, due to the way that you are running it, i.e., with TCGAbiolinks, you are restricting yourself to the data that TCGAbiolinks uses. I always obtain the data myself from the GDC Data Portal (or the GDC Legacy Archive), although, doing this requires some patience to first learn how to obtain the data.

For further information on sample types and their codes, Ill direct you to 2 resources:

Regarding GTEx, I see no reason why you cannot use it for examining 'normal' expression in your tissue of interest.

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for you to tell me these, I am still confused, I still don’t know the necessity of choosing the true normal samples, Could you tell me some reasons for it?

ADD REPLY • link 5.4 years ago by Yun ▴ 230

0

Entering edit mode

Well, would it make sense to conduct a differential expression comparison between, for example, 200 tumour samples and just 3 [true] normal samples? If you filter for matched tumour-normal pairs (normal of any type), then you may have just a 2 vs 2 comparison. The dataset with the largest tumour-normal pairs is BRCA (breast cancer), with ~112 tumour-normal pairs, if I recall. For PRAD, I think that the number is either 0 or just 2. You can match these based on the TCGA barcode.

I do not know the exact task that you have been given, but I imagine that the direction is coming from your supervisor (?) - what, exactly, have you been asked to do? As an analyst, you are also capable of making decisions.

In a nutshell, the use of normal samples in TCGA RNA-sequencing was never a priority of the TCGA consortium. Normal samples for DNA-based analysis was [the priority]. In certain analyses in the published literature for TCGA data, normal RNA-seq samples were never even used. Instead, cluster-based approaches using just the tumour samples were used.

ADD REPLY • link 5.4 years ago by Kevin Blighe 88k

1

Entering edit mode

Thanks very much, I read what you have said again, and I think I have understood what you explain. Thank you for removing my confussion.

ADD REPLY • link 5.4 years ago by Yun ▴ 230

0

Entering edit mode

Thanks for your patient explanation, can I have this understanding that it's the number of the normal samples in TCGA database that mainly restrict the use to differential expression gene in TCGA and the quality of the normal samples? is it right?

I'm a beginner in bioinformatics, and I'm also a student in medical. I am just intersted in prostate cancer, and so I keep it my studing pathway in learning bioinformatics that find the hub gene in prostate cancer. Maybe your advice of the tumour subtype will also be a better method for me.

ADD REPLY • link 5.4 years ago by Yun ▴ 230

0

Entering edit mode

I also check the batch effects in https://bioinformatics.mdanderson.org/BatchEffectsViewer/ , It seems there is no problem.

ADD REPLY • link 5.4 years ago by Yun ▴ 230