Question

Associations with small-ish patient clinical data with bulk RNA-seq expression

0

Entering edit mode

11 months ago

noodlejackson ▴ 40

Hi everyone,

I have bulk RNA-seq expression in cancer vs non-cancer tissue. I have some clinical info on the patients - like age, sex, location of tumour, cancer grade.

I'd like to find genes that associate to these variables - for example 'GeneA1 is associated with cancer grade in 50+ females for tumours in the lower part of the organ.'

What techniques would be best?

Also, I'd like to gain more experience with machine learning. Can I use ML for this, even though I don't want to apply such a model for classification or prediction. I just want to explore the data and gain ML experience.

I'd really appreciate any tips, and I apologise in advance for being naive. To clarify, I'd like to ask on the forum, as when researching methods myself, I am finding it hard to know which methods are good or relevant. There seems to be so many options.

Thank you!

statistics ai rna-seq stats machine-learning • 572 views

ADD COMMENT • link updated 11 months ago by BioinfGuru ★ 2.1k • written 11 months ago by noodlejackson ▴ 40

score 2 · Accepted Answer · 2024-07-06

Hi,

Cancer genomics is not my area, but maybe this will give you a road map with which to start.

In any comparison of condition 1 v condition 2, I use deseq2. If you go through the workflow first and then the manual you should be able to do what you need. When going through these, pay particular attention to anything related to design, contrast, and interactions.

Design:

# from the workflow:
ddsMat <- DESeqDataSetFromMatrix(countData = countdata, # counts file
                                 colData = coldata,     # metadata file
                                 design = ~ cell + dex) # "dex" is condition of interest, "cell" is another variable (i.e. samples batches)

# To begin I suggest a simple design of something like this (assuming cancer status is in a column named "condition":
design = ~ condition + location + grade

# More advanced: 
# concatenate age_sex into one column with meaningful groups
# add to design with "+ age_sex" as a variable of interest to allow you to compare values within age_sex column (i.e. groups)
# add age and sex as interactions with cancer status as they are confounders (i.e. age and sex both affect gene expression regardless of cancer status)
# design = ~ condition + location + grade + age_sex + condition:age + condition:sex

Contrast:

Use the contrast parameter of results() to compare 2 values in any 1 variable column of the metadata. For the more complex design formulas above (especially interactions) I use this guide to designs and contrasts in deseq2.

dds <- DESeq(dds)
res <- results(dds, contrast=c("dex","trt","untrt")) # from workflow - compares treated v untreated with dexamethasone
res_basic <- results(dds, contrast=c("condition","cancer","healthy"))

Hope this helps.

P.S. With concatenation of variables like age_sex in the metadata, creating a heatmap/PCA of the normalised counts (with vst or rlog from deseq2) would also be good visualisations to see overall patterns.