perplexed regarding the process of gene expression normalision
2
0
Entering edit mode
9 months ago
DS ▴ 10

Hello everyone,

I have a gene expression matrix consisting of 10 genes (rows) and 300 samples (columns). My objective is to prepare this dataset for eQTL analysis.

It is necessary to convert the data into a standard normal distribution when conducting tests on my linear mixed model.

Currently, I am working with TPM as my starting point. Here are the steps I plan to take:

  1. Apply a log transformation to TPM+1 and I intend to perform this operation on a per-gene basis.

  2. Either perform Quantile Normalization or Median Normalization - I believe it would be best to apply this step on a per-sample basis.

  3. Standardize each gene by scaling them so that the mean equals 0 and the standard deviation equals 1.

Can someone confirm if my thought process is correct? Any other comments are welcome.

gene RNA-seq expression normalization • 651 views
ADD COMMENT
1
Entering edit mode
9 months ago

I don't know what the norm is in the eQTL field. If you are going to be doing a normalisation on a per-sample basis, you almost certainly need to be doing it to more than 10 genes, in particular you definately can't do Quantile Normalisation with only 10 data points in each sample.

Because in eQTL you are comparing the same gene in many samples, I would argue for not using TPM. Instead, if the data is available I would start with raw counts, and then apply DESeq2's rLog transformation, which will take care of both normalisation and transformation to something approaching a normal distribution. rLog can be slow on large data samples (you'll still need all the genes to do a normalisation). I've not experimented with the VST transcform, but that also makes gene counts homoskedestic, on a log scale, and works faster than rLog.

ADD COMMENT
0
Entering edit mode

raw counts are available! I will check the proposed method, thank you for your input!

ADD REPLY
0
Entering edit mode
9 months ago
LChart 4.5k

Is "10" genes a typo? Maybe "10K"? Quantile and Median normalization are probably not going to do what you expect if you only have 10 genes.

For (2), these normalization steps are only defined to operate on a per-sample basis. Quantile normalization (in point of fact CQN) is commonly applied for eQTLs (and commonly skipped, as it can inadvertently deflate values for genes that have a low-frequency eQTL).

Sample outlier exclusion via PCA is also commonly applied - I don't see that step here.

Also some kind of correction for known covariates (sex, age, RIN, sequencing depth) and unknown sources of variation (PEER, RUV, SVA, etc) are applied for eQTLs - I don't see that here either. You may be intending to include these in your mixed-effect model -- but do compute and include something like PEER factors.

Z-scaling is typical and usually occurs twice (once before covariate correction, and again afterwards -- the second rescaling is not strictly necessary since it only impacts eQTL effect sizes but not p-values).

ADD COMMENT
0
Entering edit mode

it's not a typo, I am doing a very specific experiment. Good to know then QN is not an appropriate method to use.

Up until now, I have obtained my results using log(TPM+1). However, during the subsequent analysis, I discovered that my beta estimates are not standardized, compared to other experiements I have performed. Therefore, I must return to the normalization process to ensure that my genes are distributed in a standard normal.

For the eQTL mapping, indeed, I am generating PEER factors, accounting for sex, age, kinship, and other filters applied to the genotypes etc. Visual inspection for sample exclusion is indeed a good idea, was thinking to set a threshold defined as three interquartile ranges above or below the upper and lower quartile!

Thanks for your sharing your thoughts!

ADD REPLY

Login before adding your answer.

Traffic: 1718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6