I have never worked with paired-end sequencing before. I am using bwa, samtools and picard. I was wondering if anybody could point out any obvious mistakes in my workflow. I put it together after scanning through posts on Biostar.
This is my workflow before alignment:
bwa aln -t 4 genomefile data.R1.fastq > data.R1.bwa
bwa aln -t 4 genomefile data.R2.fastq > data.R2.bwa
bwa sampe genomefile data.R1.bwa data.R2.bwa data.R1.fastq data.R2.fastq > data.sam
This is my work flow after alignment:
samtools view -b -S -h -q1 -F0x04 data.sam > data.bam
java -Xmx4g -jar SortedSam.jar INPUT=data.bam OUTPUT=data_sorted.bam SORT_ORDER=coordinate
java -Xmx4g -jar MarkDuplicates.jar INPUT=data_sorted.bam OUTPUT=data_unique.bam METRICS_FILE=metrics.txt REMOVE_DUPLICATES=true TMP_DIR=/tmp
Most of my sequences align, but I lost a tremendous number of reads in the MarkDuplicates
step. I have four sequencing samples.
In one case, I lose 85% of my reads after removing duplicates. In another, I only lose 2%. This seems like a big range. Is there something wrong with my workflow? I noticed picard had some output discussing optical duplicates. I have no idea how Picard analyzes optical duplicates so I just tried to use default settings. Could that be the source of my troubles?
EDIT: My data is Chip-seq for Pol II in C.elegans. The 85% loss of reads is coming from the ChIP sample and 2% loss of reads occurred in one of the input samples.
EDIT #2: It turns out that I should have concentrated on properly paired reads, not just mapped reads, so I thought I should change the samtools line to:
samtools view -b -S -h -q1 -f0x02 data.sam > data.bam
What are you aligning exactly? The source of the reads has a huge impact on whether it is desirable or necessary to mark duplicates.
@Daniel Swan: I editted the question to add the information you requested. Just to answer your question, the reads are from input and ChIP samples for Pol II in C. elegans.