Hi, I'm currently doing a WES pipeline to identify variants in human sequences, currently using (in order of use):
Read QC and trimming: fastq
Alignment: bwa index, bwa mem, samtools view, samtools sort, and samtools index.
Remove PCR duplicates: picard markduplicates?
When it comes to removing PCR duplicates, I have seen that picard's markduplicate works to identify any duplicates.
java -jar MarkDuplicates.jar I=PE_samtoolssorted.bam
O=markedduplicates.bam M=markedduplicatesmetrics.txt
However when it comes to removing the PCR duplicates that are found online that just adding REMOVE_DUPLICATES=true removes them?
java -jar MarkDuplicates.jar I=PE_samtoolssorted.bam O=removedduplicates.bam M=markedduplicatesmetrics.txt
REMOVE_DUPLICATES=true
The output of this will be a sorted bam file with the removed PCR duplicates?
Would the input for a variant caller like deepvariant, which requires a sorted bam file be this removedduplicates.bam file?
and if so, would it be this removedduplicatessorted.bam file that needs indexing for input into deepvariant rather than the original PE_samtoolssorted.bam?
Thanks! Sorry if confusing. Amy
According to the documentation, yes,
REMOVE_DUPLICATES=true
should output an alignment file in which the duplicate reads have been removed. You will then likely need to sort the output and index the sorted file. The sorted file (which doesn't have any duplicate reads) would then be used for further downstream analyses (assuming that that is the appropriate input file for the steps you want to perform).