Question

BWA-MEM2 and samtools using for loop

0

Entering edit mode

2.6 years ago

rodolfo.peacewalker ▴ 390

Hi!

I'm trying to align some paired-end fastq files derived from target sequencing using BWA-MEM2 and samtools to filter and sort mapped reads. To avoid the accumulation of bam files after every step, I figured out that a for loop using pipes would be useful. I share my code at the moment:

##Allign the reads with respect to the reference
for file in ../data/*.fastq.gz; do
SAMPLE=$(echo ${file} | sed -e 's/R1.fastq.gz//' | sed -e 's/R2.fastq.gz//')
R1="${SAMPLE}R1.fastq.gz"
R2="${SAMPLE}R2.fastq.gz"
SAMP=$(echo ${SAMPLE} | sed 's/.*data\///' | sed 's/\_clean_paired*//')
echo "Processing sample $SAMP"
bwa-mem2 mem -t 8 hsa_GRCh38 $R1 $R2 | samtools view -Sq >= 1 - | samtools sort - -o ../output/$SAMP.bam
done

I have checked with echo that the name of the R1, R2 and output files (SAMP) are correctly generated. However, after running the code I have the next error message from samtools:

Processing sample Q-0002-18_S10_
samtools sort: failed to read header from "-"

I have checked the code for a single sample and works; however, I wonder what is going on when implement the for loop.

Thanks in advance!

BWA-MEM2 bash samtools • 2.0k views

ADD COMMENT • link updated 2.6 years ago by rpolicastro 13k • written 2.6 years ago by rodolfo.peacewalker ▴ 390

2

Entering edit mode

Your for loop should look something like this. Note the below options are untested.

for fq in $(find ../data -name "*R1.fastq.gz")
do
  bwa-mem mem -t8 hsa_GRCh38 $fq ${fq/R1.fastq.gz/R2.fastq.gz} |
    samtools view -q1 - |
    samtools sort -@8 -O BAM -o ../output/$(basename $fq R1.fastq.gz).bam -
done

You can do all of this directly using find too.

find ../data/ -name "*R1.fastq.gz" -exec sh -c \
  'bwa-mem mem -t8 hsa_GRCh38 $0 ${0/R1.fastq.gz/R2.fastq.gz} | \
    samtools view -q1 - | \
    samtools sort -@8 -O BAM -o ../output/$(basename $0 R1.fastq.gz).bam -' {} \;

GNU parallel option since it's convenient.

parallel -kj1 --link \
  bwa-mem mem -t8 hsa_GRCh38 {1} {2} '|' \
    samtools view -q1 - '|' \
    samtools sort -@8 -O BAM -o ../output/'$(basename {1} R1.fastq.gz)'.bam - \
  ::: ../data/*R1.fastq.gz ::: ../data/*R2.fastq.gz

ADD REPLY • link 2.6 years ago by rpolicastro 13k

1

Entering edit mode

I see a few problems, not sure how your single sample run worked.

> is shell redirection. How does your samtools view command work at all? -S is ignored and -q takes an INT, >=1 is not a valid parameter to anything and should break your command.
You could do for f in ../data/*R1.fastq.gz instead of a more generic glob, and use ${f/R1.fastq.gz/} instead of your double sed.
More sed and regex points but they are not mission critical

Addendum to point 1:

You're running bwa-mem twice per sample because your loop runs per fastq file, not per fastq-pair. That's where your loop logic breaks.

ADD REPLY • link 2.6 years ago by Ram 44k