I have a question about a dataset I would like to download from the GEO repository.
The data set is this one - GSE114882
I would like to download the fastq files using fasterq-dump of the sra-tools. This is working nicely, but now I try to understand the naming and I got questions.
In the Run Selector there are in total 44 runs of single-end layout. But the experiment contains only 22 samples (This fits also the number of GEO_Accessions I have in the appropriate column, onw for each two rows). Does it mean that GEO has splitted each sampled into two? Can they simply be concatenated?
thanks for the help
Assa
thanks for the fast answer. This is exactly my question though.
Let's take for example the first two rows with the GEO accession number GSM3152879. This number is connected to the two SRR number SRR7214386 SRR7214387. When I click on the details there I can see that the fastq files are named
T1-X_S11_L003_R1_001.fastq.gz
andT1-X_S36_L006_R1_001.fastq.gz
.T1-X
is the condition. It looks like they were made on two different lanes of the (same???) flow cell.Can I than assume these are two technical replicas and I can co,bine them into one smaple for downstream analysis?
thanks
If you can verify that they are from separate lanes from same flowcell (check for the flowcell serial number in fastq header and see if it matches) then you can treat them as technical sequencing replicates.
If the serial is different they may still be the same library (sample name seems to be the same in the example above) run twice on two different FC but you may need to ask the submitters to confirm that.
Yes, I also thought about it, but unfortunately GEO changes the header to:
So I guess, I'll have the ask the people who submitted teh samples
No original Illumina headers so yes that would be the prudent thing to do.