Hi all,
I hope you're good. I want to make a tissue-specific expression analysis using multiple tissues. I have lots of raw FastQ files that I retrieved from the ENCODE database. After successfully completing the alignment and quantification steps, I need to come up with consensus data for each tissue. I don't know how to do this yet, but I want to ask about a further step in the analysis. When comparing tissues in DEseq2, normally we use a control group as a reference. But in this case, I don't have any baseline reference with which I can compare all tissues. Can I still use DEseq2 for this purpose? If its possible, how exactly? If not, is there any other method you can suggest for me? I'm new to the topic and a little bit confused about it. I would appreciate it if you could help.
Thanks in advance!
Can I ask what do you exactly mean by same experiment? If you are talking about the same donor and the same procedure, unfortunately, not all the data comes from the same experiment. There are two different labs that produced these data. One is only produced single-ended, and the other is only paired-ended. There are different donors for the experiments in each lab, and the experiment IDs are different for most of the fastq files. So, they are the products of different sequencing runs. What could be done in these circumstances? Can I treat them as replicates, or should I do a filtering?
Sounds like full confounding. Be careful. Personally I would not do this comparison. There is no way to distinguish tissue effect from donor effect and batch effect. I know it is tempting, but one cannot randomly collect data and pretend it was from one experiment. RNA-seq is a relative measure, and baselines are just different.
I wrote an R package (see rrdr.io, and github) some years ago to calculate tissue specificity from bulk RNAseq based on this paper. Bear in mind this was before scRNA-seq became so popular. While I did not use deseq2 as part of the pipeline, it may give you some ideas and code snippets of how to proceed as I also included ENCODE data (23 tissues).
Unfortunately, I put the cart before the horse. The pipeline assumes all samples are comparable. I have confidence in the robustness of the pipeline, but I have yet to figure out how to ensure the count data is comparable across so many samples from different experiments. This is a massive batch effect problem.
The main problem is as ATpoint describes, you can be absolutely sure there will be confounding affects across the experiments. I'm still not even sure how to tackle this. Maybe if you can somehow find a dataset from 1 lab, with multiple tissues, and the metadata describing the data collection (i.e. when, who, how). The more of this kind of detail you can add to the metadata for every tissue, the more chance you have of finding batch effects and creating something awesome.