Question

ncbi error report log for validate fastq issue

0

Entering edit mode

14 months ago

1769mkc ★ 1.2k

Im trying to fetch a list of GSM id which could be seen that it is present in the project folder which I checked through sra explorer tool but when I try to download through a script it fails even after no of retries.

The error log generated Im attaching here I would like to know what exactly is failing here.

ncbi_error_report.txt

    <Report>
 <Run>
  <Date>
   <Start value="Wed Sep 27 2023 6:21:28 AM"/>
   <End value="Wed Sep 27 2023 6:21:44 AM"/>
  </Date>
  <Home name="HOME" value="/root"/>
  <Cwd>/tmp</Cwd>
  <CommandLine argc="6">
   <Arg index="0" value="fastq-dump"/>
   <Arg index="1" value="-X"/>
   <Arg index="2" value="1"/>
   <Arg index="3" value="-Z"/>
   <Arg index="4" value="--split-spot"/>
   <Arg index="5" value="GSM2683458"/>
  </CommandLine>
  <Result rc="RC(rcVFS,rcMgr,rcOpening,rcDirectory,rcNotFound)"/>
  <User admin="true"/>
 </Run>
 <Configuration>
  <Files count="2">
   <File name="/etc/ncbi/settings.kfg"/>
   <File name="/root/.ncbi/user-settings.mkfg"/>
  </Files>
  <refseq state="not found"/>
  <krypto state="pwfile: not found"/>
  <sra>
   <quality_type>raw_scores</quality_type>
  </sra>
  <Config>
  <ConfigurationFiles>
/etc/ncbi/settings.kfg
/root/.ncbi/user-settings.mkfg
    </ConfigurationFiles>
    <APPNAME>"fastq-dump"</APPNAME>
    <APPPATH>"/tmp/"</APPPATH>
    <BUILD>"RELEASE"</BUILD>
    <HOME>"/root"</HOME>
    <HOST></HOST>
    <LIBS>
      <GUID>"119c217a-7b81-47e8-91d6-56d19c8c9f15"</GUID>
      <IMAGE_GUID>"119c217a-7b81-47e8-91d6-62229c64ee59"</IMAGE_GUID>
    </LIBS>
    <NCBI_HOME>"/root/.ncbi"</NCBI_HOME>
    <NCBI_SETTINGS>"/root/.ncbi/user-settings.mkfg"</NCBI_SETTINGS>
    <OS>"linux"</OS>
    <PWD>"/tmp"</PWD>
    <USER></USER>
    <VDB_CONFIG></VDB_CONFIG>
    <VDB_ROOT></VDB_ROOT>
    <kfg>
      <arch>
        <bits>"64"</bits>
        <name>"56d19c8c9f15"</name>
      </arch>
      <dir>"/root/.ncbi"</dir>
      <name>"user-settings.mkfg"</name>
    </kfg>
    <libs>
      <cloud>
        <report_instance_identity>"true"</report_instance_identity>
      </cloud>
    </libs>
    <repository>
      <user>
        <ad>
          <public>
            <apps>
              <file>
                <volumes>
                  <flat></flat>
                  <flatAd>"."</flatAd>
                </volumes>
              </file>
              <refseq>
                <volumes>
                  <refseqAd>"."</refseqAd>
                </volumes>
              </refseq>
              <sra>
                <volumes>
                  <sraAd>"."</sraAd>
                </volumes>
              </sra>
              <sraPileup>
                <volumes>
                  <ad>"."</ad>
                </volumes>
              </sraPileup>
              <sraRealign>
                <volumes>
                  <ad>"."</ad>
                </volumes>
              </sraRealign>
              <wgs>
                <volumes>
                  <wgsAd>"."</wgsAd>
                </volumes>
              </wgs>
            </apps>
            <root>"."</root>
          </public>
        </ad>
      </user>
    </repository>
    <sra>
      <quality_type>"raw_scores"</quality_type>
    </sra>
    <vdb>
      <lib>
        <paths>
          <kfg>"/usr/local/bin"</kfg>
        </paths>
      </lib>
    </vdb>
  </Config>
  <RemoteAccess available="false"/>
  <CurrentProtectedRepository found="false"/>
 </Configuration>
 <Object path="https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR5755657/SRR5755657" type="database" fs_type="unexpected">
  <Dependencies>
   <List count="22" missing="22">
    <Dependency index="0" seq_id="NC_000067.5" local="false" path=""/>
    <Dependency index="1" seq_id="NC_000068.6" local="false" path=""/>
    <Dependency index="2" seq_id="NC_000069.5" local="false" path=""/>
    <Dependency index="3" seq_id="NC_000070.5" local="false" path=""/>
    <Dependency index="4" seq_id="NC_000071.5" local="false" path=""/>
    <Dependency index="5" seq_id="NC_000072.5" local="false" path=""/>
    <Dependency index="6" seq_id="NC_000073.5" local="false" path=""/>
    <Dependency index="7" seq_id="NC_000074.5" local="false" path=""/>
    <Dependency index="8" seq_id="NC_000075.5" local="false" path=""/>
    <Dependency index="9" seq_id="NC_000076.5" local="false" path=""/>
    <Dependency index="10" seq_id="NC_000077.5" local="false" path=""/>
    <Dependency index="11" seq_id="NC_000078.5" local="false" path=""/>
    <Dependency index="12" seq_id="NC_000079.5" local="false" path=""/>
    <Dependency index="13" seq_id="NC_000080.5" local="false" path=""/>
    <Dependency index="14" seq_id="NC_000081.5" local="false" path=""/>
    <Dependency index="15" seq_id="NC_000082.5" local="false" path=""/>
    <Dependency index="16" seq_id="NC_000083.5" local="false" path=""/>
    <Dependency index="17" seq_id="NC_000084.5" local="false" path=""/>
    <Dependency index="18" seq_id="NC_000085.5" local="false" path=""/>
    <Dependency index="19" seq_id="NC_000086.6" local="false" path=""/>
    <Dependency index="20" seq_id="NC_000087.6" local="false" path=""/>
    <Dependency index="21" seq_id="NC_005089.1" local="false" path=""/>
   </List>
  </Dependencies>
 </Object>
 <SOFTWARE>
  <VDBLibrary vers="2.7.47"/>
  <Build static="true">
   <Module name=""/>
  </Build>
  <Tool date="Nov 18 2022" name="fastq-dump" vers="3.0.1">
   <Binary path="/usr/local/bin/fastq-dump" type="alias" md5="c461c39bfa514aff3c4f7c0416ced617">
    <Alias resolved="fastq-dump.3">
     <Alias resolved="fastq-dump.3.0.1">
      <Alias resolved="sratools.3.0.1"/>
     </Alias>
    </Alias>
   </Binary>
  </Tool>
 </SOFTWARE>
 <Env>
 </Env>
</Report>

Any suggestion or help would be really appreciated

sra-tools • 2.3k views

ADD COMMENT • link updated 14 months ago by GenoMax 147k • written 14 months ago by 1769mkc ★ 1.2k

1

Entering edit mode

I don't think this log is helpful. Can't you just get fastq download links via sra-explorer.info or the tool mentioned by Rob in his answer here Fetch Fastq files directly for SRA data ?

Avoid SRA toolkit at all costs, it's a mess. If you're forced to use it then use prefetch to download sra files first and then use fastq-dump locally to convert the sra to fastq. Never fetch via fastq-dump directly, it's super picky and error-prone as you're experiencing.

ADD REPLY • link 14 months ago by ATpoint 85k

0

Entering edit mode

" you're forced to use it" this is sort of since I have use docker image and then pass the GSM id as list of input first to check if the there are valid data files or not then it will go to the next step of making fastq. So right now strangely this works for some project samples without any issue and for some it doesnt work at all even though I added few retries.

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k

2

Entering edit mode

Can you provide details of what you are doing and the commands you are using?

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

I will share you the shell script which is part of the pipeline where I basically call the docker image which contains the ncbi-sra tool kit and list of GSM id as input

my code

  #!/bin/sh

set -x
PS4='[\\d \\t] '

# Check parameter for error
check=0
# Print fastq-dump executable path
echo $(which fastq-dump)

# Function to download FastQ with retries
download_with_retry() {
    local id="$1"
    local retries=3
    for attempt in $(seq "$retries"); do
        # Download start of fastq
        fastq-dump $(get_ngc) -X 1 -Z --split-spot "$id" > "${id}.test.fastq" 2> "${id}.test.log"
        numLines=$(cat "${id}.test.fastq" | wc -l)
        if [ $numLines -gt 0 ]; then
            echo "${id} has data... OK"
            return 0
        else
            echo "${id} does not have data on attempt $attempt... Retrying in 5 seconds..."
            sleep 5
        fi
    done
    echo "${id} could not be downloaded after $retries attempts... ERROR"
    check=1
    return 1
}

# Loop through all parameters to check validity
for file in "$@"; do
    cp "${file}" .
    # Extract filename for sampleID
    file_basename=$(basename "${file}")
    id="${file_basename%".id"}"
    # Start validation with retries
    echo "Checking ${id}..."
    download_with_retry "$id"
done

# Exit with error if some FastQs are not accessible
if [ $check -gt 0 ]; then
    echo "ERROR: One or more samples have inaccessible FastQs.. exiting"
    exit 1
fi

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k

2

Entering edit mode

Can you also provide an example of a GSM ID that fails? I was not aware that you could use GSM ID's directly with fastq-dump. I would think that you would need to get the SRA accessions for GSM ID first. sra-explorer is doing that conversion and perhaps that is why GSM ID work there.

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

GSM2683458 this is the one test case which fails and this one GSM3603268 that woks fine

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k

1

Entering edit mode

It looks like if you use the GSM ID's directly with fastq-dump you end up with the following error (repeated 3x) though the retrieval seems to work.

fastq-dump.3.0.1 int: string unexpected while executing query within virtual file system module - multiple response SRR URLs for the same service 's3'

Mapping the GSM ID over the SRR accession first does not generate this error. Files (I only recovered a couple of reads) by both methods appear to be identical.

$  esearch -db sra -query GSM2683458 | efetch -format runinfo
    Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
    SRR5755657,2018-01-02 08:58:08,2017-06-26 10:56:24,32468069,3299947248,0,101,620,GCA_000001635.1,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos5/sra-pub-zq-14/SRR005/755/SRR5755657.sralite.1,SRX2955837,,ATAC-seq,other,GENOMIC,SINGLE,0,0,ILLUMINA,Illumina HiSeq 2500,SRP110503,PRJNA391904,2,391904,SRS2313068,SAMN07277232,simple,10090,Mus musculus,GSM2683458,,,,,,,no,,,,,GEO,SRA580888,,public,447A3DC22B1E9DF1BE6993EE6ACD98E4,0D2722F0763A64F8D93CEC360B1DBDAC

fastq-dump SRR5755657 works without errors.

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

so how do I map on the go when I have GSM id as input?

I tried this to check

fastq-dump -X 1 -Z --split-spot SRR5755657 > SRR5755657.test.fastq 2> SRR5755657.test.log

cat SRR5755657.test.log
2023-09-27T12:35:18 fastq-dump.3.0.1 warn: directory not found while opening manager within virtual file system module - can't open NC_000076.5 as a RefSeq or as a WGS
2023-09-27T12:35:18 fastq-dump.3.0.1 err: directory not found while opening manager within virtual file system module - failed SRR5755657

=============================================================
An error occurred during processing.
A report was generated into the file '/root/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to 'sra-tools@ncbi.nlm.nih.gov' for assistance.
=============================================================

fastq-dump quit with error code 3

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k

1

Entering edit mode

Did you run vdb-config -i to set up a temp directory for use with sratoolkit? Error above is for that.

so how do I map on the go when I have GSM id as input?

You can map GSM ID's to SRA accessions using EntrezDirect (LINK). Using GSM ID's directly may be tricky since some ID's may map to more than one SRA ID.

ADD REPLY • link 14 months ago by GenoMax 147k

3

Entering edit mode

NCBI is the problem because the people making decisions there lack the minimal common sense and understanding of the problems they are trying to solve

when someone needs to download a simple file, they shouldn't need to run config this or config that,

they shouldn't need to install some obtuse, buggy, overcomplicated, and inefficient tool like fastq-dump

fastq-dump and the way SRA works demonstrate the disconnect and complete lack of accountability at the highest levels - and it has been like this for perhaps two decades - all along it has been and continues to be a bottleneck to science

The choices made by NCBI are the problem

ADD REPLY • link 14 months ago by Istvan Albert 101k

0

Entering edit mode

https://hub.docker.com/r/ncbi/sra-tools this is the image im using but if I have to use this Did you run vdb-config -i to set up a temp directory for use with sratoolkit? which i did in in standalone system where it bring the gui and where we can see it but in case of image what and how am I suppose to configure the same?

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k

2

Entering edit mode

Have you seen: https://github.com/ncbi/sra-tools/wiki/SRA-tools-docker

You could try the solution mentioned here: https://github.com/ncbi/sra-tools/issues/630

ADD REPLY • link 14 months ago by GenoMax 147k

1

Entering edit mode

I got it working after i updated the sra lite option while using vdb-config -i

the output is what i see like this

docker run -t --rm -v $PWD:/output:rw -w /output kcm1400/validate_fastq_ncbi_sra:v2 fastq-dump -X 1 -Z --split-spot GSM2683458
2023-09-27T18:40:14 fastq-dump.3.0.1 int: string unexpected while executing query within virtual file system module - multiple response SRR URLs for the same service 'ncbi'
2023-09-27T18:40:14 fastq-dump.3.0.1 int: string unexpected while executing query within virtual file system module - multiple response SRR URLs for the same service 'ncbi'
2023-09-27T18:40:14 fastq-dump.3.0.1 int: string unexpected while executing query within virtual file system module - multiple response SRR URLs for the same service 'ncbi'
Read 1 spots for GSM2683458
Written 1 spots for GSM2683458
@GSM2683458.1 1 length=51
ACCTTAAAGAATTGGCTTTTTAAAACAAAAGAGGGGCAGCTATTTCTGTCT
+GSM2683458.1 1 length=51
???????????????????????????????????????????????????
@GSM2683458.1 1 length=51
AAATAGCTGCCCCTCTTTTGTTTTAAAAAGCCAATTCTTTAAGGTCTGTCT
+GSM2683458.1 1 length=51
???????????????????????????????????????????????????

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k

2

Entering edit mode

You may want to use --split-files instead. With your --split-spot option looks like you end up with interleaved data files. Unless you are dealing with it internally safer to get regular R1/R2 files.

ADD REPLY • link 14 months ago by GenoMax 147k

0

Entering edit mode

thank you for the resource i will look and try to adopt the fix and see it how it works

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k

0

Entering edit mode

while i ran this for the successful gsm ID GSM3603268 with its respective SRA i see this output which was not the case for the above

fastq-dump -X 1 -Z --split-spot SRR8571942
Read 1 spots for SRR8571942
Written 1 spots for SRR8571942
@SRR8571942.1 1 length=100
TTTTTTAAAGAAAACTTGAGCTTTTGGAGCATGGCAACCTAGCCTGCAGACACCGTATCCCCTTGTCCACTTCCCCCTGCAAACCATAAGTCCATTCCTA
+SRR8571942.1 1 length=100
=@@DDDDA++2CDABCEG@GEHHEHGA9EDFHGHD3::B??<DBFFBFB===)5;CA;(7=?;?CE;7)7;>C((583;<9>(,89@@@@CCCCDD:BC:
@SRR8571942.1 1 length=100
AGTTCCACGAGTTTTCTTTTTTTTAAGTGGTAGGAATGGACTTATGGGTTGCAGGGGGAAGTGGACAAGGGGCTACGGTGTCTGCAGGCTTGGTTGCCAT
+SRR8571942.1 1 length=100
:?=BD:D?C)<CDGH4+<CCCGHI)??D<D?00?D*?39C<CG4BFE28;4(.7AB############################################

ADD REPLY • link 14 months ago by 1769mkc ★ 1.2k