I want to find the candidate genes for 20 different chemical compounds. I am using TPM
data for 50 cultivars and have a matrix showing TPM
values for each gene for all 50 cultivars where A1---A6
are genes, A86----A60
are cultivars:
gene A86 A90 A99 A16 A09 A60
A1 0 0.4 0 0 0 0
A2 0 0 0 0 0 0
A3 0.5 0 0 0.42 0 0
A4 0 0 0 0 0 0
A5 0 0 0 0 0 0
A6 0 0 0 0 0 0
I have chemical compound concentration dataset for each compound like:
Cultivar Compound_X
A86 20.5
A90 5.6
A99 7.1
A16 12
A09 1.5
A60 9.9
I have TPM values for all cultivars but concentration values are missing for some of the cultivars for different chemical compounds. I want to run standard linear regression approach in R to find what are candidate genes for each chemical compound based on their p values.
for (gene in 1:ngenes){
model = lm(Compound_X~TPM[gene,])
}
I want to extract the p-values from the linear regression and save it to a vector for each gene for each chemical compound to find candidate genes. Thank you!
you can find p-values by using
summary
function in R : s <-summary(lm(volatile~TPM[gene,]))
. p-values are stored in thecoefficients
component e.g.s$coefficients
Thank you! I am actually not able to run the
lm
yet. I want to run it using a for loop as I mentioned in the question. I have two datasets mentioned and I want to perform thelm
step. Your suggestion will be helpful after that.Is there a way to drop the genes that have zero TPM for all cultivars?
Plase also note that raw TPM values are not normal distributed so you should not use
lm
directly on them. Log2 transform them first (remember a pseudo count of your choice).