Dear Everyone,
I am quite new to RNA-seq data analysis methods (and statistics) and I hereby I would like to ask for some help, suggestions and personal experience related to dealing with GTEx dataset. Also, my apologies in advance if this is a redundant question, however, when I was reading some forums I did not find satisfactory explanations and suggestions.
For my work, I was planning to use the publicly available GTEx data from human tissues. I would like to compare RNA-seq data from certain tissues from healthy and diseased samples. My aim would be to find Differentially expressed genes and to make additional comparisons later on ... so for this:
- I downloaded the latest GTEx Raw readcount and TPM matrices.
- From the RAW readcount matrix, I extracted those columns that matches my criteria. (selected tissues, sex, disease, etc.)
- As a pilot run , I selected 5 - 5 samples from healthy (HE) and diseased (DIS) (preferably with the highest RNA quality (RIN) values) and run a DEseq2 analyses.
As expected the PCA plot showed quite a large inter-sample variation, but surprisingly the HE and DIS samples were also not separated well. Nevertheless, I got only a few significant DEG, that was not matching with previously reported gene expression profiles.
We thought that we could find more DEG by pre-selecting more "similar" datasets for each conditions. To find more "similar" samples for each condition, I generated a PCA plot from all HE samples and DIS samples separately.
Based on these PCA plots, 10 - 10 samples were selected that were showing smaller distances from each other in both PC. After running DEseq2 on 10 vs 10 samples, the PCA plot showed again large inter-sample variations also within each conditions and returning zero significant DEG.
I also tried the same samples with limma-voom, but ended up with zero DEG again.
After this experience, I would like to ask the following questions:
- Can we use large public datasets like GTEx for DEG analyses?
- (If yes) How many samples per conditions are optimal to find DEGs and keeping the inter-sample variation less disturbing?
- is there any good method , pipeline, tool, etc... to find DEGs when we have many samples per conditions and batch effects?
- Is it a good idea to pre-select samples that are "more identical" (grouping together) based on a two dimensional PCA plot?
- Anyone else has similar experience with GTEx?
- bonus question : is it possible that some samples are mixed up on the GTEx database?
Looking forward to your responses. And thank you in advance for any suggestions and comments.
Are you only comparing samples within the GTEx dataset or are you also adding other datasets to the analysis?
Yes, I do compare only GTEx samples from the same tissues. I am not planning to add other datasets.