Hello everyone,
I am doing bulk RNA-sequencing analysis on human brain samples from ~160 donors. I am mostly of following the workflow described here with edgeR/limma. One of the first steps is removing lowly expressed genes, doing TMM normalisation and plotting the log-CPM distribution. As you can see, some samples have many of the retained genes with zero or near-zero expression:
The next step was outlier detection; and I used both hierarchical clustering and PCA for a visual inspection of the dataset. Around ~10 samples cluster away from the rest of the dataset both with hierarchical clustering and PCA (where the first PC is mostly driven by RIN). The samples labelled in the PCA plot are those that form their own cluster in the correlation heatmap.
Most of the 'outlier' samples (8 out of 11), despite having low RIN and clustering separately from the bulk of the dataset, have smooth log-CPM distributions. Instead of removing these samples altogether, I would rather use voomWithQualityWeights to account for the low RNA quality. The remaining outliers (3 out of 11) are those with really different log-CPM distributions (NND_91-IHK, NND_53-YFQ, NND_28-XPU). They still cluster away from most samples in the heatmap/PCA, but this is driven by poor sequencing quality rather than RIN. Indeed, their sequencing yield (in Mb) is very low compared to the rest of the dataset.
Very long premise for a very short question: should I still include these samples in the analysis with voomWithQualityWeights or should I remove them because the composition of their transcriptome is too different due to technical reasons?
Thanks in advance for your help!