Need help understanding GEO data structure
1
0
Entering edit mode
2.0 years ago
Assa Yeroslaviz ★ 1.9k

I have a question about a dataset I would like to download from the GEO repository.

The data set is this one - GSE114882

I would like to download the fastq files using fasterq-dump of the sra-tools. This is working nicely, but now I try to understand the naming and I got questions.

In the Run Selector there are in total 44 runs of single-end layout. But the experiment contains only 22 samples (This fits also the number of GEO_Accessions I have in the appropriate column, onw for each two rows). Does it mean that GEO has splitted each sampled into two? Can they simply be concatenated?

thanks for the help

Assa

Run-Selector GEO gene-expression-omnibus • 1.1k views
ADD COMMENT
1
Entering edit mode
2.0 years ago
GenoMax 147k

If you look at the SRA Run Browser it appears that each sample has been run twice (check Biosample and Experiment columns). https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA472972&o=acc_s%3Aa

You will need to check the details to see if these are independent libraries for each sample or they are technical sequencing replicates. You can merge the files in latter case but use as replicates in first.

ADD COMMENT
0
Entering edit mode

thanks for the fast answer. This is exactly my question though.

Let's take for example the first two rows with the GEO accession number GSM3152879. This number is connected to the two SRR number SRR7214386 SRR7214387. When I click on the details there I can see that the fastq files are named T1-X_S11_L003_R1_001.fastq.gz and T1-X_S36_L006_R1_001.fastq.gz. T1-X is the condition. It looks like they were made on two different lanes of the (same???) flow cell.

Can I than assume these are two technical replicas and I can co,bine them into one smaple for downstream analysis?

thanks

ADD REPLY
0
Entering edit mode

If you can verify that they are from separate lanes from same flowcell (check for the flowcell serial number in fastq header and see if it matches) then you can treat them as technical sequencing replicates.

If the serial is different they may still be the same library (sample name seems to be the same in the example above) run twice on two different FC but you may need to ask the submitters to confirm that.

ADD REPLY
0
Entering edit mode

Yes, I also thought about it, but unfortunately GEO changes the header to:

@SRR7214386.1 1 length=51
ANCACGTTCTAGCATTCAAGGTCCCCTGTAGGCACCATCAATAGATCGGAA
+SRR7214386.1 1 length=51
A#AFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@SRR7214386.2 2 length=51

So I guess, I'll have the ask the people who submitted teh samples

ADD REPLY
0
Entering edit mode

No original Illumina headers so yes that would be the prudent thing to do.

ADD REPLY

Login before adding your answer.

Traffic: 1831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6