Entering edit mode
6 months ago
Jacek
▴
20
So I want to analyze the gene expression from a specific cluster of genes interest, and I took the samples from a different tissue, all of them from SRA data tables, however, my samples are some of them 4 replicates, and other tissue samples only 2 and 3 replicates, do you think I still can proceed my analysis? is really difficult for me to find 4 replicates for each of my tissue interest, please help me
I am a newbie in transcriptomic analysis, I try to grasp anything that I could, I really appreciate your answer, have a good day
That you have a varying number of replicates is not ideal (see here why), but manageable. See for example the Specific experimental designs section (3.) in the edgeR vignette for possible experimental designs that match your use case. That document also explains how to detect genes with differential expression starting from count matrices. Besides edgeR, also DESeq2 is a possible software to use.
What worries me more is, that you say that some samples are from SRA data tables. This suggests that you haven't created the data and that it may have been processed entirely differently: Different library preparation method, different sequencing platform etc. This constitutes a batch effect and also a linear dependency problem with regard to the tissue. Due to the perfect confounding, it is impossible to say if a detected difference is attributable to originating from a different tissue or due to the different processing.
thank you for your reply Zepper, a bit clarification, I took all the samples from SRA data table, but yes from different lab and also different instruments for sequencing. Thanks
Could you clarify on what exactly you are referring to with "SRA data table". If you are retrieving fastq files (raw data), then you could possibly work on it; however, if it's processed data, that too from different labs and different sequencers, then you should not consider proceeding.
It's important to note that, if you have access to fastq files, you should process the samples separately based on the instrument used, as every instrument follows different protocols. Do not try to process them altogether, otherwise, you will either end up with errors during processing or messed up results.
it's raw data fastq.gz, so for example I have 4 different samples from different tissue, sample A with 3 replicates, samples B,C and D are 4 replicates. Do you mean, I process them one by one, like separately? thank you
It is still obscure to me how many different tissues you would like to include into your comparison and what your intended design is.
To illustrate the issues we pointed out here, let's make a specific example: You are interested in the gene expression of a particular gene in hematopoiesis.
In SRA, you find raw data from a study, that performed RNA-seq on lymphoid progenitors and compared them to hemocytoblasts. You find a second study in SRA from a different lab, that was interested in macrophage biology and generated RNA-seq data from the myeloid lineage and included hemocytoblasts as control as well.
Now, it may be tempting to pool the hemocytoblast samples as replicates and treat them as one. However, mind that they are the same cell type, but differ in terms of isolation, handling, RNA extraction, library preparation, sequencing etc. - so they are neither biological nor technical replicates.
For this reason, you can't just reprocess the lymphoid progenitor samples and the macrophage samples from the two different studies together, because any differences you detect may be due to being from a different lineage (~ tissue difference) or due to one of the many other variables named above. You will simply never be able to tell, and that is what I meant with linear dependency problem.
But most of those other factors are the same within each of the studies. What you can do is calling the differentially expressed genes separately, that is those of the lymphoid cells vs. their matched control on the one hand and the myeloid samples vs. their respective controls on the other hand and later intersect the DE gene lists. This is not as clean and precise as using all samples from the same study, but at least not totally off.
That being said, there are several expression reference catalogs out there, where gene expression was measured in a plethora of tissues in a highly standardized fashion. I recommend using that data rather than devising your own strategy, in particular if you are only interested in the expression on a particular gene cluster, but across many tissues.
Thank you so much Matthias this is really helpful, I got insight from this. Have a nice day
It makes absolutely no sense to conduct such an analysis. I strongly discourage you to even start with it. Results will be utterly nonsense for aforementioned reasons.