Hello, I'm trying to analyze public RNA-seq data obtained from GEO database. But I have a trouble now about how to handle some data.
The representative dataset is GSE88945 (PRJNA349164). The number of samples is just three but the number of data are 47. There are many files that have the same GSM ID and different SRR IDs.
I can describe the first header lines from several fastq files with the same GSM ID like below:
Filename
> Header line
SRR4432915_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432915.1 HWI-ST959:164:C2KV4ACXX:6:1101:1172:2037/1
SRR4432916_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432916.1 DF9F08P1:223:D2F1AACXX:7:2115:13996:65512/1
SRR4432917_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432917.1 DF9F08P1:223:D2F1AACXX:7:2304:18001:55398/1
SRR4432918_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432918.1 DF9F08P1:223:D2F1AACXX:8:1101:1402:2235/1
In this case, there are three different points to branch. 1) HWI-ST959 vs. DF9F08P1 2) 7 vs. 8 3) 2115 vs. 2304
Many questions in my mind can be summarized like below. I wonder what makes those differences between fastq files and whether I can merge the data or not. If somebody knows, please advise me.
Thanks in advance.
Thank you for your answer! You said "merge each replicates", and I want to know about that point in detail. In ONE sample (in this case, H_G3), there are some differences among multiple fastq files. But now I'm confused what files are reasonable to be merged and what files are not. As you say, is it okay to merge all fastq files for a sample?
May I ask you, what is your plan after merging the files? These are mRNA expression data.
Oh, I'm sorry for my late reply. I'm planning to do processing of those RNA-seq data, and then perform co-expression network analysis (after integrating other datasets). So I'm about to preprocess those data now. But I'm confused whether I can just simply merge data from one sample.