Hi,
I am trying to figure out if I have over marked duplicates in my merged BAM file. This file has been generated as follows:
I had 5 libraries (from the same sample) that were run on 4 lanes of NovaSeq, so e.g. for Lib_1, I have
Lib1_L001.bam Lib1_L002.bam Lib1_L003.bam Lib1_L004.bam
I aligned each against the reference genome using BWA and got per lane-BAMs. These were marked for duplicates before merging into a single BAM for merged_Lib_1. My concern is that this only marks optical dups and leaves out PCR duplicates, and I am wondering how to check this. I found some answers that help in printing out alignments with the samtools view -f 1024 flag, but I did not find a way to tell if there are PCR duplicates left behind.
So, I ran Picard CollectDuplicateMetrics for the merged_Lib_1 file and it shows 11% duplication rate. I then ran Picard MarkDuplicates again on this file and the MarkDuplicates now shows a 26% duplication rate. Is this over-marking dups or actually reflects those duplicates that were left out?
Eventually, I have to merge all the merged_Lib_1/2/3/4/5 files into a giant final BAM. I am really confused if I should run MarkDups on all the merged Lib_x files separately and then merge into a final BAM? Would be really grateful for any leads!!
My intuition is that duplicates are not 'over-marked' but this reflects reality. Something else to consider is the type of library you produced and where these duplicate sequences exist - was it polyA selected total RNA? Are all the duplicates of rRNA? If so, it's not a big deal...but it really comes down to specifics here.
I recommend to merge fastq files by sample/library before alignment. You can just
cat
the fastq.gz files, or this can be accomplished at the level of producing the fastq file by passing--no-lane-splitting
tobcl2fastq
orbcl_convert
.I don't think any aligner will correctly mark duplicate sequences by default, aligners will almost always mark reads that have 'secondary hits' as duplicates. In principle, duplicates can be marked at the level of the fastq file (before alignment) or at the level of the alignment. IMO duplicates should always be marked at the level of alignment because one cDNA molecule can lead to multiple populations of sequences, but these will most likely only align to the same coordinate. Using
Picard MarkDuplicates
you can distinguish optical duplicates from PCR duplicates, but you need to make sure to pass the appropriate flags (patterned vs unpatterned flow cell). Of course, UMIs make all the difference here and you'll need to pass these if they exist.If you want to check duplicate sequences at the level of fastq, I recommend
prinseq
because this has several algorithms to mark duplicates (5', 3', exact match, etc) but if you mark at level of fastq, you should ensure the files are 'clean' by first runningcutadapt
and removing any other artifact from library prep.Thank you for your reply. Will try these suggestions!