Question

Can we treat outliers as batch variables in linear modeling?

0

Entering edit mode

3 months ago

aUser ▴ 70

Hi everyone,

Can we treat outliers as batch variables in linear modeling, e,g in DESeq2? I know the batches are different, however, can I think that the "samples in outliers are differently processed" thus qualifying to be a different batch? I do not want to remove the outliers (based on PCA, PC1 > 200; actual value is around 600 along PC1. There are ~20 samples). I want to include them for DEG calculation. I was looking for the resources where this has been discussed, but mya be I missed.

Thank you for your input/comment.

R modeling Outlier linear • 387 views

ADD COMMENT • link 3 months ago by aUser ▴ 70

0

Entering edit mode

Can you provide more context, and also show your PCA? Were there replicates of each sample?

I'm not sure I follow your logic. Outliers in linear models are individual samples that deviate from expected distributions. If the outliers were processed in a different manner or came from the same day of sampling, for example, then there could be a technical batch effect.

ADD REPLY • link 3 months ago by dthorbur ★ 2.6k

0

Entering edit mode

Thank you for your response, and sorry for being late as we had vacations here.

I am working with TCGA-LUAD data set, and the samples are processed/normalized using DESeq2. The steps are given below:

ld_dds <- DESeqDataSetFromMatrix(countData = ld_dataPrep2,
                              colData = ld_sampleTypes,
                              design = ~sType)
ld_dds <- DESeq(ld_dds)

# extract normalized count for Clustering and Immune infiltration analysis
ld_normCount <- counts(ld_dds, normalized = TRUE)

For PCA:

ld_normCount <- ld_normCount[rowSums(ld_normCount) > 100 , ]
pca.obj = prcomp(t(ld_normCount),
                 scale. = TRUE)

pcr.objx <- as.data.frame(pca.obj$x)

dtp <- data.frame('titles' = rownames(pcr.objx),
                  pcr.objx[, c(1:3)]) # the first three components are selected

dtp2 = merge(dtp, ld_sampleTypes, by.x = "titles", by.y = 0)

#print(head(dtp2))
ggplot(data = dtp2) +
    geom_point(aes(x = PC1, y = PC2, col = sType)) + # type needs to added
    theme_minimal() +
    labs(title = "LUAD PCA")

The samples >200 along PC1 are considered as outliers (as suggested by literature).

The PCA figure is attached. NT are normals, while TP are tumor samples. PCA_image

ADD REPLY • link 3 months ago by aUser ▴ 70