I use picard to mark duplicates and found that picard dose not support multiple threads and it's very slow. To speed up it, I want to split BAM file by chromosome and then run picard on every file. The problem is, I know that picard has an advantage over samtools rmdup because picard can mark cross-chromosome duplicates. So if I split the bam file by chromosome, how important will it influence the result? Here is my consideration:
A pair of reads must come from the same DNA fragment, so these two reads mapped to the same chromosome normally. But at sometimes, these two reads mapped to different chromosome, maybe there is a structure variation or repeat such as microsatellite? If I just want to call SNV and indel, may I ignore the cross-chromosome duplicates? Please let me know if there is anything wrong with my consideration. Any reply will be much appreciated!
Another way, if the influence is important, how can I compensate it?
You can alternatively use Clumpify, which does duplicate-marking or duplicate-removal prior to mapping and is extremely fast.