Hello everyone,
I am working with data from PRJNA528920 and noticed that some BioSamples (SAMN) have multiple associated SRRs (Sequence Read Archive Runs). For example:
- SAMN11249717 = SRR8782083
- SAMN11249717 = SRR8782084
- SAMN11249716 = SRR8782085
- SAMN11249716 = SRR8782086
Additionally, I found a discrepancy between the number of samples reported in GSE128803 (which only lists 6 samples) and PRJNA528920, which contains 12 SRRs.
I read the associated paper but couldn’t find clear information about this. I also checked whether this could be related to the sequencing technology used (ION_TORRENT) but didn’t find any evidence suggesting so.
My questions are:
Do these SRRs correspond to independent sequencing runs meant to select the highest-quality one? For alignment and count table generation, should I use only the first SRR for each BioSample? Is it possible to merge them without introducing batch effects? I plan to use these data for my thesis, so I would really appreciate any guidance or experiences you can share on how to correctly process this type of data.
Thank you guys.
Hi, thank you!
I’ve already read the paper and reviewed the metadata, but I couldn’t find any additional relevant information.
Yes, there are a total of 6 samples: HCER_1,2,3 and HeLa_1,2,3. I also think these are two separate runs from the same library, or at least that’s what it seems...
However, I’m concerned about merging the FASTQ files due to potential batch effects. Would it be better to process everything separately?
I mean, should I generate count tables for all SRRs (samples) individually and then merge the counts later?
You could do that but any variation you see will have no biological significance.
Technical replication of sequencing (for Illumina for sure, probably for Ion as well) generally shows minimal variation so the data can be merged before processing.
That is no different than merging the data first and then aligning/counting. Each read is aligned independently.