retrieving paired end sequencing data with fasts-dump
2
0
Entering edit mode
2.6 years ago

Hello I ned to retrieve 42 fastq files from NCBI SRA: https://www.ncbi.nlm.nih.gov/sra?LinkName=biosample_sra&from_uid=13674977

I retrieve the SRA accession numbers and save them to a file called "SraAccList.txt" which stores the SRA accession numbers to to the sequencing data. The paper methodology mentioned they worked with paired end sequencing data. So I did the following for retrieving the fastq files:

list=$(cat SraAccList.txt)
for accs in $list
do
prefetch $accs
done

then for the retrieved .sra files I used fastq-dump to finally get the paired end reads:

for f in *.sra
do
fastq-dump --split-3 $f
done

but I only got SRR{numbers}.fastq files and not paired end reads files.

In other similar threads there is discussed the fact that it can be the case where the submitters don't provide the full fastq data but I'm not sure if that is my case or the retrieved fastq files are in interleaved format or are just single-end data.

I took a look into the run information page of the 42 SRR accessions and they are labeled as PAIRED sequencing data: https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR10912829

but it seems that definitely they only provided single-end data: SRR info

I compared these submitted SRR to a well published one and the submitters provide both pairs as shown on the green bars:

SRR info2

So it seems the submitters don't provide the complete sequencing data, is this correct?

NCBI fastq sra-toolkit • 1.2k views
ADD COMMENT
2
Entering edit mode
2.6 years ago
GenoMax 147k

If this is indeed paired-end data as described in the paper then it is unfortunate that the submitters appear to have submitted individual paired end data files as separate runs. LINK for Run Browser You can confirm that by comparing the library names (as marked below) and checking if you can relate that to the publication.

screenshot

ADD COMMENT
0
Entering edit mode

I didn't see the run browser!. Definitely they uploaded each file separated on each SRR accession. Thank you so much.

ADD REPLY
0
Entering edit mode

nice observation GenoMax in bioinformatics we have to expect the unexpected

ADD REPLY
1
Entering edit mode
2.6 years ago

I suspect they have mislabeled their data as paired.

 bio search SRR10912829

prints:

[
    {
        "run_accession": "SRR10912829",
        "sample_accession": "SAMN13674977",
        "first_public": "2020-01-20",
        "country": "Antarctica",
        "sample_alias": "Antarctic Polar",
        "fastq_bytes": "4133348227",
        "read_count": "52569045",
        "library_name": "S08-1",
        "library_strategy": "WGS",
        "library_source": "METAGENOMIC",
        "library_layout": "PAIRED",
        "instrument_platform": "ILLUMINA",
        "instrument_model": "Illumina HiSeq 2500",
        "study_title": "The polar microbiota Metagenome",
        "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR109/029/SRR10912829/SRR10912829.fastq.gz"
    }
]

note how only a single FASTQ file is provided

ADD COMMENT
0
Entering edit mode

Didn't know about bio search. really useful.

The submitters uploaded each file separated on each SRR accesion. Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2038 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6