I am trying to download some public RNA-seq data (paired-end) and I have encountered that there are some samples that have the same GEO Accession but different SRR number (and different sizes). Therefore, when I download them using sra-toolkit
and fastq-dump --split-3
I have several files for the same sample.
As you can see in the following screenshot, there are some samples that have different SRR number and GEO_Accession. However, as I said, there also some (highlighted) that have same GEO_Accession, different size and different SRR number.
When I use fastq-dump --split-3
for these samples (for example):
a) SRR7774397, I get:
SRR7774397_1.fastq
SRR7774397_2.fastq
b) SRR7774398, I get:
SRR7774398_1.fastq
SRR7774398_2.fastq
If you go the NCBI (Run Browser), they appear as two fastq files (_1 and _2):
However, theoretically they belong to the same sample...
How do you usually download this type of data? It seems that the data for some samples is splited but I do not know how to merge them or in general download them.
SRA Run Selector where all the samples appear can be found here (PRJNA488803)
Any help is really appreciated.
Thanks very much in advance
Regards
My guess is that some samples were resequenced. I'd recommend just merging the respective R1 and R2s and treat them as one sample.
Thanks for your quick reply! How would you merge them?
You can
cat
the respective R1/R2 files in same order e.g.cat Run1_R1.fq.gz Run2_R1.fq.gz > R1.fq.gz
.