Question

Multiple RNA-seq fastq files for one sample in GEO database

0

Entering edit mode

21 months ago

maximal_life ▴ 20

Hello, I'm trying to analyze public RNA-seq data obtained from GEO database. But I have a trouble now about how to handle some data.

The representative dataset is GSE88945 (PRJNA349164). The number of samples is just three but the number of data are 47. There are many files that have the same GSM ID and different SRR IDs.

I can describe the first header lines from several fastq files with the same GSM ID like below:

Filename
> Header line

SRR4432915_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432915.1 HWI-ST959:164:C2KV4ACXX:6:1101:1172:2037/1

SRR4432916_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432916.1 DF9F08P1:223:D2F1AACXX:7:2115:13996:65512/1

SRR4432917_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432917.1 DF9F08P1:223:D2F1AACXX:7:2304:18001:55398/1

SRR4432918_GSM2355695_H_G3_Homo_sapiens_RNA-Seq_1.fastq.gz
> @SRR4432918.1 DF9F08P1:223:D2F1AACXX:8:1101:1402:2235/1

In this case, there are three different points to branch. 1) HWI-ST959 vs. DF9F08P1 2) 7 vs. 8 3) 2115 vs. 2304

Many questions in my mind can be summarized like below. I wonder what makes those differences between fastq files and whether I can merge the data or not. If somebody knows, please advise me.

Thanks in advance.

GEO RNA-seq multiplefiles • 1.2k views

ADD COMMENT • link 20 months ago by maximal_life ▴ 20

score 0 · Answer 1 · 2023-08-24

They have three samples: H_G3 ( 16 replicates), H_G5 (13 replicates) & H_G14 ( 18 replicates). See below-

esearch -db bioproject -query "PRJNA349164" | elink -target sra | efetch -format runinfo| cut -d "," -f1,30 
#Note: SamplesGroup column is added after running the above command
Run SampleName  SamplesGroup
SRR4432915  GSM2355695  H_G3
SRR4432916  GSM2355695  H_G3
SRR4432917  GSM2355695  H_G3
SRR4432918  GSM2355695  H_G3
SRR4432919  GSM2355695  H_G3
SRR4432920  GSM2355695  H_G3
SRR4432921  GSM2355695  H_G3
SRR4432922  GSM2355695  H_G3
SRR4432923  GSM2355695  H_G3
SRR4432924  GSM2355695  H_G3
SRR4432925  GSM2355695  H_G3
SRR4432926  GSM2355695  H_G3
SRR4432927  GSM2355695  H_G3
SRR4432928  GSM2355695  H_G3
SRR4432929  GSM2355695  H_G3
SRR4432930  GSM2355695  H_G3
SRR4432931  GSM2355696  H_G5
SRR4432932  GSM2355696  H_G5
SRR4432933  GSM2355696  H_G5
SRR4432934  GSM2355696  H_G5
SRR4432935  GSM2355696  H_G5
SRR4432936  GSM2355696  H_G5
SRR4432937  GSM2355696  H_G5
SRR4432938  GSM2355696  H_G5
SRR4432939  GSM2355696  H_G5
SRR4432940  GSM2355696  H_G5
SRR4432941  GSM2355696  H_G5
SRR4432942  GSM2355696  H_G5
SRR4432943  GSM2355696  H_G5
SRR4432944  GSM2355697  H_G14
SRR4432945  GSM2355697  H_G14
SRR4432946  GSM2355697  H_G14
SRR4432947  GSM2355697  H_G14
SRR4432948  GSM2355697  H_G14
SRR4432949  GSM2355697  H_G14
SRR4432950  GSM2355697  H_G14
SRR4432951  GSM2355697  H_G14
SRR4432952  GSM2355697  H_G14
SRR4432953  GSM2355697  H_G14
SRR4432954  GSM2355697  H_G14
SRR4432955  GSM2355697  H_G14
SRR4432956  GSM2355697  H_G14
SRR4432957  GSM2355697  H_G14
SRR4432958  GSM2355697  H_G14
SRR4432959  GSM2355697  H_G14
SRR4432960  GSM2355697  H_G14
SRR4432961  GSM2355697  H_G14

You may merge each replicates or analyze them separately. It will depend upon what basically you want to achieve. Also, it looks like they have provided raw and normalized data for this study.

enter link description here