Tutorial:Trim & align paired-end reads in a single pass
1
9
Entering edit mode
8.4 years ago

I thought about sharing this little script to:

  • trim,
  • align,
  • sort,
  • clip overlap

paired-end fastq files in a single stream, i.e. without writing intermediate files.

This is to save time and space on disk, maybe useful to others (but also to check I'm not doing anything wrong or weird...):

paste <(zcat reads_R1.fq.gz | paste - - - -) \
      <(zcat reads_R2.fq.gz | paste - - - -) \
| tr '\t' '\n' \
| cutadapt --interleaved -a AGATCGGAAGAGC -A AGATCGGAAGAGC - \
| bwa mem -p genome.fa - \
| samtools sort -o - -O bam -T aln.tmp.bam \
| bam clipOverlap --in -.bam --out -.bam > aln.bam &&
samtools index aln.bam

The main trick is to convert paired-end fastq files to interleaved format (paste and tr commands), then use cutadapt to read and write interleaved reads and bwa mem to align reads from stdin.

I deliberately left out any optional parameter to each program (I usually set -M in bwa mem). Note that if the two reads in a pair are completely overlapping, clipOverlap will set the cigar string of one of the two reads to fully soft clipped (e.g. 75S). Picard will complain in such cases unless you set the VALIDATION_STRINGENCY to LENIENT or SILENT.

This was with:

cutadapt Version 1.9

bwa Version: 0.7.12-r1039

samtools Version: 1.1 (using htslib 1.1)

bam clipOverlap Version: 1.0.12

align trim fastq pipe paired-end • 6.3k views
ADD COMMENT
1
Entering edit mode

check cutadapt doesn't remove the whole sequence from your reads.

ADD REPLY
1
Entering edit mode

Hi Pierre, what do you mean? If a read partially matches the adapter at 3'end, only the overlapping part is trimmed.

ADD REPLY
1
Entering edit mode

I see. I remember when I used cutadapt, I got some empty reads at the end (and bwa doesn't like this).

ADD REPLY
1
Entering edit mode

Yes, cutadapt can give empty reads if the whole read is trimmed, i.e. it is only adapter sequence. In this cases bwa mem seems to do the right thing by reporting the empty reads as unmapped and with no sequence (older versions might crash). Anyway, cutadapt has -m option to discard reads shorter than INT bases (default is 0) so empty reads never occur. I usually set -m to around 15 or above.

ADD REPLY
1
Entering edit mode
8.4 years ago
chen ★ 2.5k

Try AfterQC (https://github.com/OpenGene/AfterQC), this tool can do trimming and pair-end overlap analysis in fastq level.

ADD COMMENT
0
Entering edit mode

Thanks, but I usually don't (actually never) do paired-end overlap at fastq level. If I have to, I clip paired-end overlaps after alignment using bam clipOverlap, which nicely works with streams (in fact I'm going to edit my post to include it). Does AfterQC work on streams? The nice thing about the cutadapt + bwa mem pipeline above is that you don't need to dump massive (and time consuming) trimmed fastq files on disk.

ADD REPLY

Login before adding your answer.

Traffic: 1177 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6