Entering edit mode
4 months ago
Mgs3
▴
30
I performed an LDA analysis on my data, and I want to extract the variables that most impacted the ordination. If I understood correctly, I can obtain that information from the "scaling" component of the LDA object. In the script below, I use lda_variable_impact <- as.data.frame(novaseqmod$scaling). Can someone confirm if this is correct?
Additional information: The input file "full" contains samples in columns and bacterial genus in rows, with the percentage count of bacteria in each sample.The "metadata" file associate each sample to a condition
This is my code:
full[1:4,1:4] 1
Bacteria;Acidobacteriota;Blastocatellia;Blastocatellales;Blastocatellaceae;Stenotrophobacter 0.14613673
Bacteria;Planctomycetota;Planctomycetes;Gemmatales;Gemmataceae;Fimbriiglobus 0.09421973
Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Comamonadaceae;Ideonella 0.00000000
Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Rhodobacter 0.02412305
2
Bacteria;Acidobacteriota;Blastocatellia;Blastocatellales;Blastocatellaceae;Stenotrophobacter 0.13213317
Bacteria;Planctomycetota;Planctomycetes;Gemmatales;Gemmataceae;Fimbriiglobus 0.04552392
Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Comamonadaceae;Ideonella 0.02481054
Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Rhodobacter 0.03015960
3
Bacteria;Acidobacteriota;Blastocatellia;Blastocatellales;Blastocatellaceae;Stenotrophobacter 0.06054529
Bacteria;Planctomycetota;Planctomycetes;Gemmatales;Gemmataceae;Fimbriiglobus 0.14525729
Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Comamonadaceae;Ideonella 0.01825357
Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Rhodobacter 0.03187948
4
Bacteria;Acidobacteriota;Blastocatellia;Blastocatellales;Blastocatellaceae;Stenotrophobacter 0.09844844
Bacteria;Planctomycetota;Planctomycetes;Gemmatales;Gemmataceae;Fimbriiglobus 0.02838332
Bacteria;Proteobacteria;Gammaproteobacteria;Burkholderiales;Comamonadaceae;Ideonella 0.00000000
Bacteria;Proteobacteria;Alphaproteobacteria;Rhodobacterales;Rhodobacteraceae;Rhodobacter 0.04247574
# LDA
nrep <- 1000
ldares <- data.frame(novaseq = rep(NA, nrep))
for (aaa in 1:nrep) {
cat("Replicate", aaa, "out of", nrep, "\n")
# Scale for LDA
scalenovaseq <- data.frame(scale(t(full)))
scalenovaseq$type <- metadata[, myvar][match(row.names(scalenovaseq), metadata$Sample_name)]
scalenovaseq <- scalenovaseq[order(as.numeric(row.names(scalenovaseq))), ]
# Create training set and test set
chooseme <- sample(c(TRUE, FALSE), nrow(scalenovaseq), replace = TRUE, prob = c(0.7, 0.3))
trainnovaseq <- scalenovaseq[chooseme, ]
testnovaseq <- scalenovaseq[!chooseme, ]
# Remove variables that are not variable within condition
sdnovaseq <- apply(trainnovaseq[, !names(trainnovaseq) %in% "type"], 2, sd)
removeme <- sdnovaseq < 0.01
trainnovaseq <- trainnovaseq[, !removeme]
testnovaseq <- testnovaseq[, !removeme]
# Create LDA model
novaseqmod <- lda(type ~ ., data = trainnovaseq)
novaseqpredicted <- predict(novaseqmod, testnovaseq)
ldares[aaa, "novaseq"] <- mean(novaseqpredicted$class == testnovaseq$type)
}
# Calculate explained variance for each linear discriminant
explained_variance <- novaseqmod$svd^2 / sum(novaseqmod$svd^2)
# Extract and save LDA loadings
lda_variable_impact <- as.data.frame(novaseqmod$scaling)
cross posted: https://stackoverflow.com/questions/78777739