I thought about sharing this little script to:
- trim,
- align,
- sort,
- clip overlap
paired-end fastq files in a single stream, i.e. without writing intermediate files.
This is to save time and space on disk, maybe useful to others (but also to check I'm not doing anything wrong or weird...):
paste <(zcat reads_R1.fq.gz | paste - - - -) \
<(zcat reads_R2.fq.gz | paste - - - -) \
| tr '\t' '\n' \
| cutadapt --interleaved -a AGATCGGAAGAGC -A AGATCGGAAGAGC - \
| bwa mem -p genome.fa - \
| samtools sort -o - -O bam -T aln.tmp.bam \
| bam clipOverlap --in -.bam --out -.bam > aln.bam &&
samtools index aln.bam
The main trick is to convert paired-end fastq files to interleaved format (paste
and tr
commands), then use cutadapt to read and write interleaved reads and bwa mem to align reads from stdin.
I deliberately left out any optional parameter to each program (I usually set -M in bwa mem
). Note that if the two reads in a pair are completely overlapping, clipOverlap will set the cigar string of one of the two reads to fully soft clipped (e.g. 75S
). Picard will complain in such cases unless you set the VALIDATION_STRINGENCY to LENIENT or SILENT.
This was with:
cutadapt Version 1.9
bwa Version: 0.7.12-r1039
samtools Version: 1.1 (using htslib 1.1)
bam clipOverlap Version: 1.0.12
check cutadapt doesn't remove the whole sequence from your reads.
Hi Pierre, what do you mean? If a read partially matches the adapter at 3'end, only the overlapping part is trimmed.
I see. I remember when I used cutadapt, I got some empty reads at the end (and bwa doesn't like this).
Yes, cutadapt can give empty reads if the whole read is trimmed, i.e. it is only adapter sequence. In this cases bwa mem seems to do the right thing by reporting the empty reads as unmapped and with no sequence (older versions might crash). Anyway, cutadapt has -m option to discard reads shorter than INT bases (default is 0) so empty reads never occur. I usually set -m to around 15 or above.