Question

Demultiplexing before merging paired-end reads?

0

Entering edit mode

6.0 years ago

asidhu ▴ 10

I'm trying to create a pipeline that converts FASTQ --> VCF and it would include the option of single-end and paired-end reads. I am a little bit confused about how to approach this for paired-end reads as they are more complicated. For single-end pairs I've done the following steps:

Demultiplex with sabre se
Trim adapters with Cutadapt
Map reads with BWA
Convert .SAM to .BAM using samtools view
Sorted alignments using samtools sort
Created an index using samtools index
Used samtools mpileup to create .BCF
Converted .BCF to .VCF using bcftools call

For single-end pairs this has worked well but I am unsure of the process for paired-end reads. I am new to bioinformatics so I am still in the learning process.

For paired-end reads I was told I need to merge the read pairs together before cutting the adapters on each end. Would the process by as follows:

Demultiplex with sabre pe
Join the forward and reverse pairs using pandaseq
Trim the adapters from each side using cutadapt and since with pandaseq a single merged file is created, treat as single-end?

sabre pandaseq cutadapt • 3.8k views

ADD COMMENT • link updated 5.5 years ago by Biostar 20 • written 6.0 years ago by asidhu ▴ 10

0

Entering edit mode

It seems pandaseq doesn't know anything about adapters, so you should trim adapters before joining reads - the adapters will be at the end of the reads, and will prevent paired read merging. Also, use Trimmomatic or BBDuk, which trim the reads as pairs.

ADD REPLY • link 6.0 years ago by h.mon 35k

0

Entering edit mode

For paired-end reads I was told I need to merge the read pairs together before cutting the adapters on each end.

That will only work if you are sure that all reads are supposed to overlap in the middle (i.e. insert sizes are shorter than sequencing cycles).

ADD REPLY • link 6.0 years ago by GenoMax 147k

score 0 · Answer 1 · 2018-11-14

While you didn't specify the end goal of your pipeline, let me give you some feedback.

Don't reinvent the wheel when excellent pipelines already exist

Map reads with BWA
Convert .SAM to .BAM using samtools view
Sorted alignments using samtools sort

These don't have to be separate steps:

bwa mem ref.fa read_R1.fastq.gz read_R2.fastq.gz | samtools sort -o my_sorted_alignment.bam

Saves you time and storage.

Join the forward and reverse pairs using pandaseq

I think you should just leave them paired-end separate files and align them as above - without merging.

score 0 · Answer 2 · 2018-11-14

This post is between a comment and a response. I work a lot with ancient DNA where fragments are very short (e.g. 40bp). This entails that even for a paired-end run you constantly have to deal with "single-end" reads (merged paired-end) and genuine paired-end data. The logic was getting very convoluted and messy. At the Max Planck Institute in Leipzig, we developed the following methodology (starting from fastq but you can easily start from bcl) :

fastq2bam in.fwd.fastq.gz in.rev.fastq.gz  | leeHom  -o /dev/stdout /dev/stdin   | deML -i index -o /dev/stdout  /dev/stdin  |  bwa bam2bam -g [genome] /dev/stdin | samtools sort /dev/stdin outputprefix

This is one line and you have a bam file without any data redundancy.

fastq2bam is found here: https://github.com/grenaud/BCL2BAM2FASTQ/

leeHom is found here: http://grenaud.github.io/leeHom/

deMLis found here: http://grenaud.github.io/deML/

a hacked bwa that eats bam is found here: https://github.com/mpieva/network-aware-bwa

a hacked samtools that sorts by going over the file once is found here: https://github.com/mpieva/samtools-patched