Could anyone explain the difference between the options -s and -S for "samtools rmdup"? In addition, is it a standard to use -sS in order to remove duplicate reads?
I recently tried to remove the duplicates in one Bam file. After running the command line "samtools rmdup -sS in.nameSrt.bam out .bam", the size of Bam file decreased from 11G to 5.2G and the log showed that there were 52.48% reads that had been removed. I'm really worried about the massive amount of data loss.
By the way, one of my goal in the downstream analysis is to call genotypes and detect SNPs.
the question is for single-ended (-s). does MarkDuplicate works with SE ?
Do you know the difference between the two options -s and -S? Could I use -s only to avoid too much data loss?
are you really using single-end data ?
No, the sequencing is done using Illumina HiSeq 2000 which should generate paired-end data.
so you don't have to deal with those options
-s
and you should use MarkDuplicates. http://samtools.sourceforge.net/samtools.shtml Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picard's MarkDuplicate which correctly handles these cases, although a little slower.I would like to try MarkDuplicates if I could. However, I need to process thousands of Bam files and my pipeline relies heavily on Samtools.
In principle I don't see any problem in passing a file to MarkDuplicates instead of samtools. Maybe you should give more detail of your pipeline.