Demultiplexing before merging paired-end reads?
2
0
Entering edit mode
6.1 years ago
asidhu ▴ 10

I'm trying to create a pipeline that converts FASTQ --> VCF and it would include the option of single-end and paired-end reads. I am a little bit confused about how to approach this for paired-end reads as they are more complicated. For single-end pairs I've done the following steps:

  • Demultiplex with sabre se
  • Trim adapters with Cutadapt
  • Map reads with BWA
  • Convert .SAM to .BAM using samtools view
  • Sorted alignments using samtools sort
  • Created an index using samtools index
  • Used samtools mpileup to create .BCF
  • Converted .BCF to .VCF using bcftools call

For single-end pairs this has worked well but I am unsure of the process for paired-end reads. I am new to bioinformatics so I am still in the learning process.

For paired-end reads I was told I need to merge the read pairs together before cutting the adapters on each end. Would the process by as follows:

  • Demultiplex with sabre pe
  • Join the forward and reverse pairs using pandaseq
  • Trim the adapters from each side using cutadapt and since with pandaseq a single merged file is created, treat as single-end?
sabre pandaseq cutadapt • 3.8k views
ADD COMMENT
0
Entering edit mode

It seems pandaseq doesn't know anything about adapters, so you should trim adapters before joining reads - the adapters will be at the end of the reads, and will prevent paired read merging. Also, use Trimmomatic or BBDuk, which trim the reads as pairs.

ADD REPLY
0
Entering edit mode

For paired-end reads I was told I need to merge the read pairs together before cutting the adapters on each end.

That will only work if you are sure that all reads are supposed to overlap in the middle (i.e. insert sizes are shorter than sequencing cycles).

ADD REPLY
0
Entering edit mode
6.1 years ago

While you didn't specify the end goal of your pipeline, let me give you some feedback.

Don't reinvent the wheel when excellent pipelines already exist

Map reads with BWA
Convert .SAM to .BAM using samtools view
Sorted alignments using samtools sort

These don't have to be separate steps:

bwa mem ref.fa read_R1.fastq.gz read_R2.fastq.gz | samtools sort -o my_sorted_alignment.bam

Saves you time and storage.

Join the forward and reverse pairs using pandaseq

I think you should just leave them paired-end separate files and align them as above - without merging.

ADD COMMENT
0
Entering edit mode
6.1 years ago
Gabriel R. ★ 2.9k

This post is between a comment and a response. I work a lot with ancient DNA where fragments are very short (e.g. 40bp). This entails that even for a paired-end run you constantly have to deal with "single-end" reads (merged paired-end) and genuine paired-end data. The logic was getting very convoluted and messy. At the Max Planck Institute in Leipzig, we developed the following methodology (starting from fastq but you can easily start from bcl) :

fastq2bam in.fwd.fastq.gz in.rev.fastq.gz  | leeHom  -o /dev/stdout /dev/stdin   | deML -i index -o /dev/stdout  /dev/stdin  |  bwa bam2bam -g [genome] /dev/stdin | samtools sort /dev/stdin outputprefix

This is one line and you have a bam file without any data redundancy.

fastq2bam is found here: https://github.com/grenaud/BCL2BAM2FASTQ/

leeHom is found here: http://grenaud.github.io/leeHom/

deMLis found here: http://grenaud.github.io/deML/

a hacked bwa that eats bam is found here: https://github.com/mpieva/network-aware-bwa

a hacked samtools that sorts by going over the file once is found here: https://github.com/mpieva/samtools-patched

ADD COMMENT

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6