Question

Extract SRA reads per RG using fasterq-dump

0

Entering edit mode

4.1 years ago

MAPK ★ 2.1k

I am trying to download dbGAP SRA samples. I used to use fastq-dump with the following command below, but for this particular project fastq-dump is running really slow because of larger datasets. So, I wanted to use fasterq-dump tool, but couldn't figure out how I could split reads per RG tags. I tried fasterq-dump with the same command below, but it looks like fasterq-dump doesn't have defline option. Any suggestions?

This is the command I use with fastq-dump:

prefetch --ngc /dbGaP/prj_222.ngc -X 9999999999999 ${SRR}    
IFS=$'\n'
RGLINES=($(sam-dump --ngc /dbGaP/prj_222.ngc ./${SRR} | sed -n '/^[^@]/!p;//q' | grep ^@RG))
args=(tee)
for RGLINE in ${RGLINES[@]}; do
unset IFS
RG=(${RGLINE})
args+=(\>\(grep -A3 --no-group-separator \"\\.${RG[1]#ID:}/[12]$\" \| gzip \> "./${SRR}.${RG[1]#ID:}.fastq-dump.split.defline.z.tee.fq.gz"\))
done
args+=(\>/dev/null)
echo "Splitting ${SRR}.sra into ${#RGLINES[@]} ReadGroups"
fastq-dump-orig.2.10.8 --ngc /dbGaP/prj_222.ngc --split-3 --defline-seq '@$ac.$si.$sg/$ri' --defline-qual '+' -Z "./${SRR}" | eval ${args[@]}

SRA fasterq-dump dbGAP • 1.6k views

ADD COMMENT • link 4.1 years ago by MAPK ★ 2.1k

0

Entering edit mode

You should use prefetch to first download the SRA file and then use fastq-dump on that file. I am almost certain that fastq-dump alone will not manage to download large files without at least one connection error. prefetch is much more stable. See the last section of Fast download of FASTQ files from the European Nucleotide Archive (ENA)

ADD REPLY • link 4.1 years ago by ATpoint 85k

0

Entering edit mode

I actually downloaded SRA with prefetch first and then used that in fastq-dump -Z "./${SRR}". Not sure if this is the correct way to use downloaded SRA folder.

ADD REPLY • link 4.1 years ago by MAPK ★ 2.1k