RNAseq of primary tumor and metastases in two different organ
0
0
Entering edit mode
8 months ago

Hello,

I am doing RNAseq of primary tumor and metastases in bone and lung, but I am still confused in how unravel the DEGs exclusive for one organ and vice versa. For example, how should I find the top 25 genes exclusive for bone and the top 25 genes exclusive for lung and plot them in heatmap in order to visualize in a single heatmap these 50 genes (25 for bone and 25 for lung) for the three different conditions (primary, bone and lung)?

Do to this I did a super long code, that I don't know if is correct or not. Can you tell me if this is correct and if there is a shorter way to do what I want to do?

Thank you so much in avdance.

Here my code so far:

dds <- DESeqDataSetFromMatrix(countData=RawCounts,
                           colData=sampleTable,
                           design= ~ Tropism)
dds <- dds[rowSums(counts(dds)) >= 10, ]
colData(dds)$Tropism <- relevel(colData(dds)$Tropism,
                                  ref="Breast")
dds <- estimateSizeFactors(dds)
dds <- DESeq(dds)

res.shr.bone <- lfcShrink(dds, coef = "Tropism_Bone_vs_Breast",type="apeglm")
res.shr.lung <- lfcShrink(dds, coef = "Tropism_Lung_vs_Breast",type="apeglm")
resFix.bone <- res.shr.bone[!is.na(res.shr.bone$padj),]
resFix.lung <- res.shr.lung[!is.na(res.shr.lung$padj),]

keys <- keys(org.Hs.eg.db)
s <- AnnotationDbi::select(org.Hs.eg.db,keys=keys,
                           columns=c("ENSEMBL","SYMBOL","ENTREZID"))
s <- s[!is.na(s$ENSEMBL),c("ENSEMBL","SYMBOL","ENTREZID")]

resDF.bone <- as.data.frame(resFix.bone)
resDF.lung <- as.data.frame(resFix.lung)
resDF.bone$ENSEMBL <- rownames(resDF.bone)
resDF.lung$ENSEMBL <- rownames(resDF.lung)

library(plyr)

resDF.bone <- join(resDF.bone,s,by=c("ENSEMBL"))
resDF.bone <- resDF.bone[!duplicated(resDF.bone$ENSEMBL),]
resDF.lung <- join(resDF.lung,s,by=c("ENSEMBL"))
resDF.lung <- resDF.lung[!duplicated(resDF.lung$ENSEMBL),]

# Here I take up and down gene for bone and lung:
resString.bone <- resDF.bone[abs(resDF.bone$log2FoldChange) >= 1 & 
                     (resDF.bone$padj < 0.01),]
resString.lung <- resDF.lung[abs(resDF.lung$log2FoldChange) >= 1 & 
                     (resDF.lung$padj < 0.01),]
resString.bone.up <- resDF.bone[resDF.bone$log2FoldChange >= 1 & 
                     (resDF.bone$padj < 0.01),]
resString.lung.up <- resDF.lung[resDF.lung$log2FoldChange >= 1 & 
                     (resDF.lung$padj < 0.01),]
resString.bone.down <- resDF.bone[resDF.bone$log2FoldChange <= -1 & 
                     (resDF.bone$padj < 0.01),]
resString.lung.down <- resDF.lung[resDF.lung$log2FoldChange <= -1 & 
                     (resDF.lung$padj < 0.01),]

# In this way I have up and down for each, then I do intersection to obtain shared one:
#Select Shared Genes
library(dplyr)

shared.up<- as.data.frame(intersect(resString.bone.up$SYMBOL, resString.lung.up$SYMBOL))
colnames(shared.up)[1] <- "SYMBOL"

shared.down<- as.data.frame(intersect(resString.bone.down$SYMBOL, resString.lung.down$SYMBOL))
colnames(shared.down)[1] <- "SYMBOL"

# Select exclusive genes for bone and lung
exclusive_bone <- anti_join(resString.bone, shared.up, by = 'SYMBOL')
exclusive_lung <- anti_join(resString.lung, shared.up, by = 'SYMBOL')
exclusive_bone <- anti_join(exclusive_bone, shared.down, by = 'SYMBOL')
exclusive_lung <- anti_join(exclusive_lung, shared.down, by = 'SYMBOL')

# And finally I do normalization and plotting:
#Get normalized counts for the top 25 variable genes
topGenesBone <- exclusive_bone[order(exclusive_bone$padj)[1:25], c("ENSEMBL", "SYMBOL")]
topGenesLung <- exclusive_lung[order(exclusive_lung$padj)[1:25], c("ENSEMBL", "SYMBOL")]
topGenes <- rbind(topGenesBone,topGenesLung)

normalized_counts_top <- counts(dds, normalized = T)[topGenes$ENSEMBL,]

#Identify the samples corresponding to the reference group "Breast" "Primary"
reference_group_samples <- colnames(dds)[dds$Tropism == "Breast" & dds$Condition == "Primary"]

#Calculate mean expression for each gene in the reference group
mean_reference_group <- rowMeans(normalized_counts_top[, reference_group_samples])

#Calculate log2FC for each gene relative to the mean of the reference group
log2FC_top <- normalized_counts_top

for (sample in colnames(log2FC_top)) {
  log2FC_top[, sample] <- log2(normalized_counts_top[, sample] / mean_reference_group)
}

rownames(log2FC_top) <- topGenes$SYMBOL

#Customize the appearance of the heatmap
    pheatmap(log2FC_top,
         annotation_legend = TRUE,
         cluster_rows = TRUE,
         cluster_cols = FALSE,
         show_colnames = TRUE,
         color = colorRampPalette(c("navy", "white", "firebrick3"))(100),  
         fontsize = 10,  
         border_color = "white", 
         main = "Top 50 Differentially Expressed Genes",
         angle_col = '45)
metastases RNA-seq DEG R • 1.5k views
ADD COMMENT
0
Entering edit mode

Can you please make the question clear? Most people will not go through this wall of text and unformatted code (10101 button is your friend). Just show the experimental design and groups, and then one can tell you how to code that up in DESeq2.

ADD REPLY
1
Entering edit mode

I've formatted their post for them but making the question clear is all up to them.

ADD REPLY
0
Entering edit mode

Sorry if the question was not clear. I have three condition (Breast, Bone and Lung) and I want to know which DEG are exclusive for bone and which one for lung, since Breast is my reference group. And once I have these exclusive DEGs I want to plot the top 25 per condition (bone and lung) in the same heatmap. Here the design of the sample table

sampleTable <- data.frame(
  SampleID,
  Condition = c("Metastasis","Metastasis","Metastasis","Metastasis","Metastasis","Metastasis","Metastasis","Metastasis", "Primary", "Primary", "Primary", "Primary"),
  Replicate = c(1, 2,3,4, 1, 2,3,4,1,2,3,4),
  Tropism = c("Bone", "Bone","Bone","Bone","Lung", "Lung", "Lung","Lung","Breast","Breast","Breast","Breast")
)
ADD REPLY
0
Entering edit mode

Not that Im an authority on this but surely before comparing primary v mets you should be comparing normal v cancerous in each tissue to give a baseline. Then compare primary v mets, to see how the baseline has changed?

Otherwise, you are comparing 2 altered conditions (primary + mets), without any negative control (non-cancerous) so the differences in gene expression seen may be the result of differences naturally found between those tissues before cancer was present.

Stupid analagy:

  • Experiment: What is the difference between crashing a bike into a fence or a swimming pool?
  • Results: Fences turn crashed bikes blue, swimming pools turn crashed bicycles green
  • Mistake: 1 bike was blue, the other red, before the crash.

First:

  • breast normal v breast primary --> breast_P_DEGs
  • bone normal v bone mets --> bone_M_DEGs
  • lung normal v lung mets --> lung_M_DEGs

Second

  • breast_P_DEGs v bone_M_DEGs
  • breast_P_DEGs v lung_M_DEGs

Just an idea. Could be way off.

ADD REPLY
0
Entering edit mode

Thank you very much for the answer, however I only care about expression in the cancer cells, since my data are coming from xenograft models.

However, what I need help for is make a for cycle that for each gene present in resString.bone check if the same gene in resDF.lung has a log2foldchange between -0.2 and 0.2, and if it is true I want to save it in a new dataframe. Thank you so much

ADD REPLY
1
Entering edit mode

Most advise to avoid For Loops in R (don't ask me why, unexpected results and behaviours, poor efficiency I think). I would use tidyverse functions. Something along the lines of what you are already doing

# Load Library
library(dplyr)

# Sample tibble
my_tib <- tibble(w = c(NA, 4, 7, 2, 9),
                 x = c(1,3,3,3,5),
                 y = c(92,3,3,3,10),
                 z = c(27,12,13,12,14))
my_tib

# Select only rows that meet the conditions
results_tib <- my_tib |> filter((x == 1 | x == 3) & y == 3 & z > 12)

Also, it is my preference to have one large object containing all data with an additional column for tissue. so that I can extract from the same object each time to avoid confusion. Then rather than creating a separate object for each tissue, I can consider grouping by tissue. In this case the grouping isn't doing much but it might be useful for other transformations.

# extract all genes that pass threshold for fdr and logfoldchange
my_data |> 
   group_by(tissue) |>
   filter(fdr < 0.5 & log2foldchange %in% (-0.2:0.2))
ADD REPLY
0
Entering edit mode

Yeah, I heard that for cycle are not super suggested in R. I can't use tibble since my bone df has a different size than lung df.. However the 2 conditions are super similar so I would like to filter it with more stringent parameters. can you help me with the for cycle?

ADD REPLY
0
Entering edit mode

Really not sure how to tackle it with a for loop. Probably needs a nested for loops with nested if loops as well (no thanks!). I suggest posting on stackoverflow a simple representation of the problem if you insist on using a for loop

You could also post on bioconductor if you'd like to try solving it with dplyr (recommended).

Regarding the different sized dataframes, I'm assuming you mean one or the other has more rows or columns, in which case a full join will work.

library(dplyr)
bone <- tibble(gene = c("a", "b", "c"), count = c(1, 10, 3)) # 2 cols, 3 rows
lung <- tibble(gene = c("a", "b"), count = c(1.1, 5), something_else = c("blue", "white")) # 3 cols, 2 rows
bone_and_lung <- full_join(bone, lung, join_by(gene))
bone_and_lung |> 
  mutate(l2fc = log2(count.y/count.x)) |>
  filter(l2fc > -0.2 & l2fc < 0.2)
ADD REPLY
0
Entering edit mode

Moreover, I see I have several genes with pvalue smaller than 10^-300, is this fine? Should I set I threshold?

ADD REPLY
1
Entering edit mode

P-values are not corrected for the multiple comparisons inherent in differential gene expression analysis. FDR (aka qvalue) is used for selecting deferentially expressed genes (i.e. genes where FDR<0.05). Bear in mind these values are nothing more than selection cut-offs. There is nothing inherently right about 0.05 - or any other cutoff. The cutoff is arbitrary. With noisy data with few results, the cut-off has to be more generous and the results will need additional evidence.

ADD REPLY
0
Entering edit mode

on top of this I have many genes with a padj equal to zero, is this coming from an error or it's just that the value is too small for RStudio? How should I resolve it?

ADD REPLY
1
Entering edit mode

That's normal. Anything lower than the lower limit of R's numerical value range is rounded to 0. No resolution is required. Just be careful if you are plotting a volcano plot. I think 0's are ignored... but I'm not sure about this.

See Jared Andrew's answer on the same query

ADD REPLY
0
Entering edit mode

Please do not add answers unless you're answering the top level question. Instead, use Add Comment or Add Reply as appropriate. I've moved your post to the right location this time, please be more careful in the future.

ADD REPLY

Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6