Question

retrieving paired end sequencing data with fasts-dump

0

Entering edit mode

2.6 years ago

v.berriosfarias ▴ 140

Hello I ned to retrieve 42 fastq files from NCBI SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=13674977

I retrieve the SRA accession numbers and save them to a file called "SraAccList.txt" which stores the SRA accession numbers to to the sequencing data. The paper methodology mentioned they worked with paired end sequencing data. So I did the following for retrieving the fastq files:

list=$(cat SraAccList.txt)
for accs in $list
do
prefetch $accs
done

then for the retrieved .sra files I used fastq-dump to finally get the paired end reads:

for f in *.sra
do
fastq-dump --split-3 $f
done

but I only got SRR{numbers}.fastq files and not paired end reads files.

In other similar threads there is discussed the fact that it can be the case where the submitters don't provide the full fastq data but I'm not sure if that is my case or the retrieved fastq files are in interleaved format or are just single-end data.

I took a look into the run information page of the 42 SRR accessions and they are labeled as PAIRED sequencing data: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10912829

but it seems that definitely they only provided single-end data: SRR info

I compared these submitted SRR to a well published one and the submitters provide both pairs as shown on the green bars:

SRR info2

So it seems the submitters don't provide the complete sequencing data, is this correct?

NCBI fastq sra-toolkit • 1.2k views

ADD COMMENT • link updated 2.6 years ago by Istvan Albert 102k • written 2.6 years ago by v.berriosfarias ▴ 140

1

Entering edit mode

2.6 years ago

Istvan Albert 102k

I suspect they have mislabeled their data as paired.

 bio search SRR10912829

prints:

[
    {
        "run_accession": "SRR10912829",
        "sample_accession": "SAMN13674977",
        "first_public": "2020-01-20",
        "country": "Antarctica",
        "sample_alias": "Antarctic Polar",
        "fastq_bytes": "4133348227",
        "read_count": "52569045",
        "library_name": "S08-1",
        "library_strategy": "WGS",
        "library_source": "METAGENOMIC",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 2500",
        "study_title": "The polar microbiota Metagenome",
        "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/029/SRR10912829/SRR10912829.fastq.gz"
    }
]

note how only a single FASTQ file is provided

ADD COMMENT • link 2.6 years ago by Istvan Albert 102k

0

Entering edit mode

Didn't know about bio search. really useful.

The submitters uploaded each file separated on each SRR accesion. Thank you!

ADD REPLY • link 2.6 years ago by v.berriosfarias ▴ 140

score 2 · Accepted Answer · 2022-05-01

2

Entering edit mode

2.6 years ago

GenoMax 147k

If this is indeed paired-end data as described in the paper then it is unfortunate that the submitters appear to have submitted individual paired end data files as separate runs. LINK for Run Browser You can confirm that by comparing the library names (as marked below) and checking if you can relate that to the publication.

screenshot

ADD COMMENT • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

I didn't see the run browser!. Definitely they uploaded each file separated on each SRR accession. Thank you so much.