Hello,
I have 5 RNA-seq samples (with 3 repetition for each sample) from different cell lines that are not related to each other. My goal is to analyze gene expression, but I am unsure about the best approach to design the analysis. Since I don't have control samples as a reference for the comparison (e.g., healthy vs. cancer cells), I am focusing on the expression levels of each gene in each sample individually.
If I only have one sample, how should I interpret the results? How can I determine if a gene is highly expressed? Additionally, if I have only one condition (one sample without any reference sample), how should I create the design matrix?
If I had experiments with a few conditions that I wanted to compare to each other, I would do something like this:
group = factor(sampleinfo$condition)
y <- DGEList(count_data, group=group)
condition <- factor(sampleinfo$condition)
design <- model.matrix(~ 0 + condition)
yNorm <- calcNormFactors(y)
But I can't do model.matrix
with only one sample.
Would it make sense to perform gene expression analysis on all samples together and normalize them with 'between samples' normalization methods, even though they are completely unrelated? This way, I could compile them into one table and identify the top 20 genes in each sample.
I would greatly appreciate any help!!!
Thank you for your response! I failed to mention that I have 3 replicates per sample. If I want to analyze just one sample (with 3 replicates) at a time, how can I determine if a gene is highly or lowly expressed without any baseline (no control)?
The 5 samples were simply sequenced together, and I don't need to compare them. Maybe mentioning them initially added unnecessary complexity to my question.
I'm more confused about designing the analysis than the technical aspects.
Are you interested in genes enriched specifically in each cell type? Or do you really just want a ranked list for each sample?
If you really only care about gene expression within the sample (ranked list), it can be more simple analysis. Generate transcript-length-normalized counts (RPK, RPKM or TPM). This will give you read density per gene. Then put all values in a table to look at genes with highest density as most expressed genes.
Between samples normalization is not necessary, but if you do RPKM or TPM, that is a form of between samples normalization anyways.
I'm not great with stats, but you can probably get fancy with tests to call "statistically enriched" genes based on the variance from your replicates. You are would be looking for genes with higher than average expression, so significantly different from the average.
You can also consider Z-score. Using length-normalized counts, you subtract the average count of all genes from the counts for each gene and divide by the standard deviation. This will give you a stat of how many far from the mean a gene is (e.g. a value of 1 = 1 standard deviation away from the mean). Then plot a heatmap to find genes with consistently high zscore.
(Xgene - MEANallgenes) / SDallgenes
Many of the most highly expressed genes will be basic housekeeping genes that many people don't find interesting, so you can consider filtering these out, maybe by determining the common genes in each cell sample.
Thank you for your answer!!! I think ranked list might work. Do I need to use calcNormFactors(y) to normalize the library size between the replicas, and then do RPKM(y)? I think RPKM and TPM scales the count data by gene length, so it doesn't adjust the library size.