Question

Problem with data downloaded from Short Reads Archive (SRA)

0

Entering edit mode

7 months ago

Begonia_pavonina ▴ 200

I have downloaded reads from the SRA with the script below in order to process them through the meta barcoding analysis pipeline, DADA2. Unfortunately, I end up with the following error when processing them through the pipeline:

Error: BiocParallel errors 0 remote errors, element index: 156 unevaluated and other errors first remote error: Execution halted

I know some of the reads are not problematic, as I can at least process three of them without error. But a larger number does not work, and there is apparently no way to distinguish the reads that work from the other. The fastq files have all the expected size, and do not seem to be corrupted.

Did this problem happen to anyone? Is there anything I should know about SRA data?

# Perform the search and retrieve metadata
esearch -db sra -query "PRJNA997374[All Fields] AND rbcl[Title]" | efetch -format docsum > sra_results.xml

# Extract SRA accession numbers from the XML output
grep '<Sample acc=' sra_results.xml | sed 's/.*acc="\([^"]*\)".*/\1/' > list_sra.txt

# Get the data in .sra format:
prefetch *.sra

# Specify the file containing SRA accession numbers
input_file="list_sra.txt"

# Loop through each accession number in the input file
while IFS= read -r accession_number
do
    # Run prefetch to download the SRA data
    prefetch "$accession_number"
done < "$input_file"

# Import files in working directory
ls -d SRR* > directories.txt

while read f;
do 
    cp "$f"/*.sra ./ ;
    done < directories.txt 

# convert all the files in frw and rev fasta formats:
fastq-dump --split-files *.sra

SRA DADA2 metabarcoding • 660 views

ADD COMMENT • link updated 7 months ago by atharvakarkare14 ▴ 40 • written 7 months ago by Begonia_pavonina ▴ 200

0

Entering edit mode

One cannot help you with this. The error is from R, yet you show not a single line of R code.

ADD REPLY • link 7 months ago by ATpoint 86k

0

Entering edit mode

Thank you ATpoint, I did already investigate this error on the DADA2 github. It seems that the "BiocParallel" error can be due to multiple things, and as I have not modified the code of the pipeline, maybe it is not relevant to show it here. However, as explained in my post, I am confident that some of the reads are the cause of the issue. It is why I show the script used to import the reads. I thought that the error might be certainly here.

ADD REPLY • link 7 months ago by Begonia_pavonina ▴ 200

2

Entering edit mode

This fastq-dump --split-files *.sra gives you fastq files. I am not sure how, with the given information, one might debug your problem. As a lowlevel validation you can run fastqc on the data and see whether this throws any errors. If not then the files are probably not corrupted.

ADD REPLY • link 7 months ago by ATpoint 86k

2

Entering edit mode

Use vdb-validate included in sratoolkit to check your *.sra files for integrity.

ADD REPLY • link 7 months ago by GenoMax 148k

score 1 · Answer 1 · 2024-05-02

1

Entering edit mode

7 months ago

atharvakarkare14 ▴ 40

Prefetch files using sratoolkit prefetch command and then use fasterq-dump. There you can see the number of reads. For validation, no tool exists other than vdb-validate.

ADD COMMENT • link 7 months ago by atharvakarkare14 ▴ 40