Question

Close correlation between differentially-expressed genes

0

Entering edit mode

2.2 years ago

psm ▴ 130

Would really appreciate any thoughts/opinions on what might be going on with my dataset - thanks in advance :)

I performed RNA-seq (bulk) of kidney samples from a cohort of 50 patients with a defined disease, confirmed by pathologist review of sample slides. When I visualized the transcriptomes by tSNE for fun, I found very distinct clustering - the groups were split 60:40 (similar results on PCA, UMAP). I can't identify any differences between the groups - no major differences in age, sex, clinical parameters collected, sequencing batch, or anything else I can think of. Weirdly, the genes that are differentially expressed between the two (DESeq2, p adjusted < 0.05, absolute log2FoldChange >1) are almost all upregulated in one group (let's call it group A), except for a single poorly annotated transcript in the other group (group B). Moreover - the genes that are upregulated are all very closely correlated - median Pearson's rho of 0.87! Many (most) of these genes are not known to be expressed in the kidney, and are mostly absent completely from group B.

We looked to see if there was any other tissue sampled in our biopsy in that cluster, which could explain it... but nothing clear on the pathology report, and pathway enrichment doesn't show much; the genes don't really follow an organ-specific pattern, a lot of weird neural/endocrine/gonadal homeoboxes and transcription factors. Again, the pattern of weird genes is very closely conserved among all group A samples.

My senses are tingling for something technical explaining this. But I can't think of what. The RNAseq was done on FFPE samples, so maybe this is all just poor-quality RNA... but why would the SAME set of genes be upregulated in poor quality samples? Has anyone ever seen anything like this, or have any insight on what might be going on?

Cheers!

rna-seq • 637 views

ADD COMMENT • link updated 2.2 years ago by bompipi95 ▴ 170 • written 2.2 years ago by psm ▴ 130

score 0 · Answer 1 · 2022-10-27

0

Entering edit mode

2.2 years ago

bompipi95 ▴ 170

Seems like there are unknown sources of variation in the data. Have you tried running the sva package to identify possible surrogate variables, and then adjusting for them in the downstream limma analysis step? If not, you can take a look at sections 3-6 from the sva vignette. Hope this helps!

ADD COMMENT • link 2.2 years ago by bompipi95 ▴ 170