Question

Should primary tumor RNA-Seq match their derived cell lines and PDXs? And which dimensionality reduction method should I use?

1

Entering edit mode

2.1 years ago

sarahgzb ▴ 40

I have processed a bunch of RNA-Seq data coming from patient primary tumours and their respective established cell lines and PDXs. My question is, should the cell lines/PDXs cluster together with their original primary tumour?

Also, which dimensionality reduction technique would be best suited to visualize this (PCA, tSNE, UMAP)?

Lastly, how should the raw counts be normalized to get the best clustering?

PDX dimensionality-reduction RNA-Seq PCA • 1.7k views

ADD COMMENT • link 2.0 years ago by sarahgzb ▴ 40

1

Entering edit mode

My question is, should the cell lines/PDXs cluster together with their original primary tumour?

In my experience, mostly but not always. If you gave me a bunch of patient and their PDX unlabeled on a PCA plot, I would not be able to pick out the pairs. It would definitely be odd to see a large PC1 separation between a patient and their PDX model though, unless something dramatically changed over the transplant generations (which should not usually happen)

Lastly, how should the raw counts be normalized to get the best clustering?

I've been struggling with this myself. The UQpgQ2 method described here is useful, and at an eve more basic level, rank-based correlation (Spearman) helps.

ADD REPLY • link 2.1 years ago by Ram 45k

score 3 · Accepted Answer · 2023-06-27

sarahgzb , I'll throw in a few thoughts alongside others.

should the cell lines/PDXs cluster together with their original primary tumour?

I wouldn't think about this in terms of should or should not. I would think about what it means if they do or do not. If one patient-derived xenograft is no more or less similar, on average, to the original primary tumor, than another one, I think the appropriate conclusion is that the PDXs lose substantial representativeness of the tumor as they exist in the patients themselves. This in turn could impact differences in response to the things you care about ...

Having said that, coding for shared patient origin as well as primary tumor vs PDX and controlling for both may enable you to partial out variance that is occurs while the tumor is xenografted (i.e., if it happens to most/all of the PDXs), helping within-patient relationships to pop out.

Also, which dimensionality reduction technique would be best suited to visualize this (PCA, tSNE, UMAP)?

They tell you different things. tSNE and UMAP will give a better representation of overall nearness/distance between like samples that summarizes many potential relationships. This is because these algorithms are designed to map high-dimensional datasets onto just a 2D plane. However, that comes at the cost of deforming part of the map sometimes (just like alaska and russia appear to be far apart on a mercator projection). By contrast, standard PC plots will plot 2 PCs against each other. If you find that your PCs correlate with one or more covariates in your model, you might want to be looking at PC plots - heck, you may even want to consider including a given PC as a covariate in your model ... there's no correct answer a priori; there's only what gives you insight into your data.

Lastly, how should the raw counts be normalized to get the best clustering?

I'd reframe this a bit. Instead of saying, what mathematical procedure produces the best clustering results (again, a priori), I would normalize the samples with several different normalization methods, then cluster each, then determine which normalization scheme worked the best empirically (a posteriori). But how can you do this? Well, if the normalization scheme was good, then it should enable you to reproduce findings about your phenotypes of interest through key gene clusters (pathways, gene modules, gene metamodules, etc.). This puts the observable data and published biology (rather than an abstract argument about which mathematical procedure "should" be best) back in the driver's seat.

As an example, consider tumor subtype assignment for GBM. If one or the other method enables more accurate assignment of the correct tumor subtype to each sample, then you can make a biological argument that that scheme performed the best. The kicker is if you can then also show that those assignments are in fact more accurate using a different data modality or orthogonal analytic approach. For instance, if you know that some of the GBM subtypes also have characteristic deletions and where those deletions are, you could attempt to verify your assignments by seeing which normalization scheme produces assignments that most closely align with known gene clusters of that type AND deletion locations (which could be detected through expression data or paired DNA samples or in another way)...

A nice ancillary point is that running multiple normalization methods and learning the effects of doing it over time also helps you to develop intuition about what might work well for future datasets.