Hi,
I have about a dozen RNA-seq samples from a human tissue (DRG), which during RNA extraction, were enriched for one of the cell types found in that tissue (neurons). In other words, all samples should ideally be from neurons. But I have the suspicion that some of the samples are bad, in that they either have heavy contamination from other cell types OR the library preparation was somehow compromised.
To find the samples that are indeed coming from neurons (irrespective of the phenotypic difference that I am studying, which in this case is pain), I am thinking of doing clustering of all the samples based on their coding transcriptome i.e. the TPM values of all the coding genes in each sample. I hope to be able to find a big cluster that would represent samples that are actually coming from neurons, and then some samples to be outliers indicating undesirable samples.
Is there a known tool/package that could help me do this?
NOTE - I don't want to use the phenotype information (like WT/experimental) which DESeq2 uses for its analysis. I just want to cluster them based on their overall individual gene expression
My whole code is here -
counts <- featureCounts(nthreads=3, isGTFAnnotationFile=TRUE, annot.ext="/Volumes/bam/DRG/annotations/Homo_sapiens.GRCh38.95.gtf", files=c('40T4L.fastqAligned.sortedByCoord.out.bam', '41T7R.fastqAligned.sortedByCoord.out.bam', '42T7L.fastqAligned.sortedByCoord.out.bam', '42T7R.fastqAligned.sortedByCoord.out.bam', '44T10R.fastqAligned.sortedByCoord.out.bam', '44T11L.fastqAligned.sortedByCoord.out.bam', '44T11R.fastqAligned.sortedByCoord.out.bam', '45T10L.fastqAligned.sortedByCoord.out.bam', '45T11R.fastqAligned.sortedByCoord.out.bam', '46T8L.fastqAligned.sortedByCoord.out.bam', '46T8R.fastqAligned.sortedByCoord.out.bam', '47T7L.fastqAligned.sortedByCoord.out.bam', '47T7R.fastqAligned.sortedByCoord.out.bam'))$counts
sampleTable <- data.frame(condition = factor(c('P', 'P', 'P', 'P', 'NP', 'NP', 'NP', 'NP', 'P', 'P', 'P', 'P', 'P')))
coldata <- sampleTable
deseqdata <- DESeqDataSetFromMatrix(countData=counts, colData=coldata, design=~condition)
dds <- DESeq(deseqdata)
vsd <- vst(dds)
plotPCA(vsd, ntop=1000) + geom_text(aes(label=name),vjust=2,check_overlap = TRUE,size = 4)
Thanks a lot! I am trying to implement
plotPCA
, and I have used the code here to load my BAMs to FeatureCounts and transfer that to DESeq2. Now, to use plotPCA, should I run DESeq2 on the samples first? Because the link you shared seems to applyplotPCA
to data on whichvst
has been applied, and that they do after running DESeq ondds
. I ask this because I was not really planning to do any DE analysis right now, just wanted to see how the samples are on the transcriptomic level (and if there are outliers)OK, so I ran DESeq on the samples and then used plotPCA. My question, however, remains that is there a way that can cluster the samples just based on the individual gene expression values, not relying on the phenotype. Just changing the condition vector for DESeq2 from the phenotype classes to the filenames didn't work (it gave the error
the design matrix has the same number of samples and coefficients to fit so estimation of dispersion is not possible.
). So, is there a way to cluster the samples based on their individual expression levels, and not based on their phenotype?clustering in plotPCA is already based on normalized expression values of the
ntop
genes with the highest variance. Phenotype information are only used to label your samplesPCA is based on the gene expression, so is the heatmap I linked. The phenotype is simply for coloring the data points. Please show all code you used if you want to debug this error.
Thanks! I have added my whole code to the question now
You can skip
DESeq()
it is not necessary prior to runningvst
.plotPCA(vsd, ntop=1000, "condition")
should do it plus any optional ggplot parameters you like.Thanks I tried it without running DESeq() and was able to get the results. Interestingly, as you said, the results are exactly the same!