Question

Please Help Me Check My Workflow For Paired-End Sequencing

2

Entering edit mode

12.2 years ago

KCC ★ 4.1k

I have never worked with paired-end sequencing before. I am using bwa, samtools and picard. I was wondering if anybody could point out any obvious mistakes in my workflow. I put it together after scanning through posts on Biostar.

This is my workflow before alignment:

bwa aln -t 4 genomefile data.R1.fastq >  data.R1.bwa
bwa aln -t 4 genomefile data.R2.fastq >  data.R2.bwa
bwa sampe genomefile data.R1.bwa data.R2.bwa data.R1.fastq data.R2.fastq > data.sam

This is my work flow after alignment:

samtools view -b -S -h -q1 -F0x04 data.sam > data.bam

java -Xmx4g -jar SortedSam.jar INPUT=data.bam OUTPUT=data_sorted.bam SORT_ORDER=coordinate

java -Xmx4g -jar MarkDuplicates.jar INPUT=data_sorted.bam OUTPUT=data_unique.bam METRICS_FILE=metrics.txt REMOVE_DUPLICATES=true TMP_DIR=/tmp

Most of my sequences align, but I lost a tremendous number of reads in the MarkDuplicates step. I have four sequencing samples.

In one case, I lose 85% of my reads after removing duplicates. In another, I only lose 2%. This seems like a big range. Is there something wrong with my workflow? I noticed picard had some output discussing optical duplicates. I have no idea how Picard analyzes optical duplicates so I just tried to use default settings. Could that be the source of my troubles?

EDIT: My data is Chip-seq for Pol II in C.elegans. The 85% loss of reads is coming from the ChIP sample and 2% loss of reads occurred in one of the input samples.

EDIT #2: It turns out that I should have concentrated on properly paired reads, not just mapped reads, so I thought I should change the samtools line to:

samtools view -b -S -h -q1 -f0x02 data.sam > data.bam

bwa picard samtools workflow • 5.4k views

ADD COMMENT • link updated 12.2 years ago by Pablo Marin-Garcia ★ 2.0k • written 12.2 years ago by KCC ★ 4.1k

0

Entering edit mode

What are you aligning exactly? The source of the reads has a huge impact on whether it is desirable or necessary to mark duplicates.

ADD REPLY • link 12.2 years ago by User 59 13k

0

Entering edit mode

@Daniel Swan: I editted the question to add the information you requested. Just to answer your question, the reads are from input and ChIP samples for Pol II in C. elegans.

ADD REPLY • link 12.2 years ago by KCC ★ 4.1k

score 3 · Answer 1 · 2013-04-27

This could be library prep issue rather than a bioinformatics workflow issue. If the ChIP failed for some reason, you'd get a low-complexity library which in turn would have lots of PCR artifacts that would be removed by MarkDuplicates.

Sometimes when you have low library complexity you'll have a lot of adapter dimers, too -- you can check the overrepresented sequences section of FastQC output to see if adapters are overrepresented in the ChIP libraries.

Ram · Answer 2 · 2013-04-27

One small suggestion;

bwa sampe genomefile data.R1.bwa data.R2.bwa data.R1.fastq data.R2.fastq | samtools view -bShq 1 -F 4  - > data.bam

Or

bwa aln -t 4 genomefile data.R1.fastq >  data.R1.bwa
bwa aln -t 4 genomefile data.R2.fastq | bwa sampe genomefile data.R1.bwa - data.R1.fastq data.R2.fastq | samtools view -bShq 1 -F 4  - > data.bam

It just makes fewer intermediate files. .sams in particular are big, so you might as well go straight through to making the .bam file.

You might be able to pipe the results of samtools view into your sorting function, too.

Nothing look wrong here, I'd say believe your data. Maybe your libraries are poor, and that's why your stats are bad. Not your fault.

The one possible weak point is the duplicate removal; if your experiment is such that you expect vast coverage, you might be wrongfully getting rid of natural duplicates that are not artifacts. But I'm not sure that's likely with your experiment. Examining the reads flagged as duplicates might help you see if that's the case. But in general, removing duplicates is safe when using paired end data.

Ram · Answer 3 · 2013-05-01

Seems to me that you have a problem with adaptors self ligation. One possible cause of it is an inaccurate qPCR calculation for exaaple when there is not a clear 'average size' in your library and therefore is difficult to asses the right molarity in order to add the right amount of adapters. You would be able to see this with fastqc and looking at the bionalyzer plot of your sample before adaptors ligation.

Also by using fastqc you would be able to see the quality at the end of your reads. If your reads are 100pb or more, it is highly probably that you would need to clean part of the end of the read. You can do a hard cut (looking at fastqc quality boxplots and look when cycles go down of Q30) or an adaptive one.

I would recommend you to improve the web lab part (ampure/pipin for removing self adaptor ligations and look at the bioanalyzer or qiaxcel images of the library distribution) and at the bioinformtatics QC steps: remove adaptors and trimming low quality bases at the end of the reads.And if your reads are longer than 70pb to try the new bwa mem algorithm.

For an adaptive trimming you can use option -q 15 in bwa aln

bwa aln -q15 -t 4 ......

or you can use seqtk:

# Trim low-quality bases from both ends using the Phred algorithm:
seqtk trimfq in.fq > out.fq

or cutadapt (that also removes the adapters) but from the help it says that -q option only removes at the end (by contrast seqtk removes from both ends)

Note: if your library insert size distribution is too broad and you have not eliminated (ampure, pipin) the small fragments then you would like to remove the adapters (cutadapt is a good tool) unless they were already removed at demultiplexing time.

./cutadapt -f fastq  -q 20 -a AGATCGGAAGAGC input_file.fastq

I would recommend the use of seqtk or cutadapt because you would be able to use the pipeline also with the newer versions of bwa that have an improved algorithm for aligning reads longer than 70 pb (bwa mem), that seems not to have an option for adaptive trimming

bwa mem -t4 -R $read_group_string -CM ref.fa reads1.fastq reads2.fastq > paired_reads.sam