BWA-MEM2 and samtools using for loop
0
0
Entering edit mode
2.6 years ago

Hi!

I'm trying to align some paired-end fastq files derived from target sequencing using BWA-MEM2 and samtools to filter and sort mapped reads. To avoid the accumulation of bam files after every step, I figured out that a for loop using pipes would be useful. I share my code at the moment:

##Allign the reads with respect to the reference
for file in ../data/*.fastq.gz; do
SAMPLE=$(echo ${file} | sed -e 's/R1.fastq.gz//' | sed -e 's/R2.fastq.gz//')
R1="${SAMPLE}R1.fastq.gz"
R2="${SAMPLE}R2.fastq.gz"
SAMP=$(echo ${SAMPLE} | sed 's/.*data\///' | sed 's/\_clean_paired*//')
echo "Processing sample $SAMP"
bwa-mem2 mem -t 8 hsa_GRCh38 $R1 $R2 | samtools view -Sq >= 1 - | samtools sort - -o ../output/$SAMP.bam
done

I have checked with echo that the name of the R1, R2 and output files (SAMP) are correctly generated. However, after running the code I have the next error message from samtools:

Processing sample Q-0002-18_S10_
samtools sort: failed to read header from "-"

I have checked the code for a single sample and works; however, I wonder what is going on when implement the for loop.

Thanks in advance!

BWA-MEM2 bash samtools • 2.0k views
ADD COMMENT
2
Entering edit mode

Your for loop should look something like this. Note the below options are untested.

for fq in $(find ../data -name "*R1.fastq.gz")
do
  bwa-mem mem -t8 hsa_GRCh38 $fq ${fq/R1.fastq.gz/R2.fastq.gz} |
    samtools view -q1 - |
    samtools sort -@8 -O BAM -o ../output/$(basename $fq R1.fastq.gz).bam -
done

You can do all of this directly using find too.

find ../data/ -name "*R1.fastq.gz" -exec sh -c \
  'bwa-mem mem -t8 hsa_GRCh38 $0 ${0/R1.fastq.gz/R2.fastq.gz} | \
    samtools view -q1 - | \
    samtools sort -@8 -O BAM -o ../output/$(basename $0 R1.fastq.gz).bam -' {} \;

GNU parallel option since it's convenient.

parallel -kj1 --link \
  bwa-mem mem -t8 hsa_GRCh38 {1} {2} '|' \
    samtools view -q1 - '|' \
    samtools sort -@8 -O BAM -o ../output/'$(basename {1} R1.fastq.gz)'.bam - \
  ::: ../data/*R1.fastq.gz ::: ../data/*R2.fastq.gz
ADD REPLY
1
Entering edit mode

I see a few problems, not sure how your single sample run worked.

  1. > is shell redirection. How does your samtools view command work at all? -S is ignored and -q takes an INT, >=1 is not a valid parameter to anything and should break your command.
  2. You could do for f in ../data/*R1.fastq.gz instead of a more generic glob, and use ${f/R1.fastq.gz/} instead of your double sed.
  3. More sed and regex points but they are not mission critical

Addendum to point 1:

You're running bwa-mem twice per sample because your loop runs per fastq file, not per fastq-pair. That's where your loop logic breaks.

ADD REPLY

Login before adding your answer.

Traffic: 2246 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6