When I use picard MarkDuplicates
to deduplicate a merged bam file from a sci-ATAC-seq experiment, duplicates remain in the file.
The merged bam file comes from a processing pipeline that takes paired-end fastq files and associates individual reads with nuclei of origin via associated molecular barcodes, changing sequencing platform-specific information in fastq files to barcode information—and thus, after the fastq files are aligned, in bam qname fields. The pipeline aligns the fastq files with bowtie2
and merges resulting bam files with sambamba merge
. I sort the merged bam file by coordinate with picard SamSort
, and then I call picard MarkDuplicates
:
picard MarkDuplicates \
--INPUT "${file}.bam" \
--OUTPUT "${file_rmdup}.bam" \
--METRICS_FILE "${file_rmdup}.txt" \
--REMOVE_DUPLICATES true
While a small number of duplicates appear to be removed, many more remain in the bam file:
samtools view "${file_rmdup}.bam" | head -20
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 99 chrX 3041 42 50M = 3205 214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 99 chrX 3041 42 50M = 3205 214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:-3 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 147 chrX 3205 42 50M = 3041 -214 CTGGGTGGGACACAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 147 chrX 3205 42 50M = 3041 -214 CTGGGTGGGACTCAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA EEEEE<EEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:11A38 PG:Z:MarkDuplicates XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-3 YS:i:0 YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0 99 chrX 22016 39 50M = 22179 213 ACATTCTAGCGTGACCCCAGTCAATTCTGAGACTGAAGACAAACTTAATA AAAAAEEEEEEEE/EEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEAEE MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:-25 YS:i:0 YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0 99 chrX 22016 39 50M = 22179 213 ACATTCTAGCGTGACCCCAGTCAATTCTGAGACTGAAGACAAACTTAATA AAAAAEEEAEAEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEAEA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:-30 YS:i:0 YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0 147 chrX 22179 39 50M = 22016 -213 TAAAATGGCTTGTCATAATGACTGAAATACTTGAGCCTACTTCCTTGTCC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:-15 YS:i:0 YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0 147 chrX 22179 39 50M = 22016 -213 TAAAATGGCTTGTCATAATGACTGAAATACTTGAGCCTACTTCCTTGTCC EEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:-15 YS:i:0 YT:Z:CP
# GAATTCGTACGGCGTTAACTCTAACTCGTAATCTTA:0 99 chrX 23671 42 50M = 23786 165 CATTGATAATCCCAGCCTAAACGACCAGCAGTCACTCTGGAAGCCACAGA A/A6A6A66AEAEAA//A6/A6//AE6/A6/AAA6666AA/<//A/66A/ MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTACGGCGTTAACTCTAACTCGTAATCTTA:0 147 chrX 23786 42 50M = 23671 -165 TCAATCCCTCATTATCATCCTCAGACCTTTTGTTATAGTCTCTCCCACTA 6EEEEEEEAEE/EEEEEEEEEE/EEEEEEEEEE6E6EEEE6E/EEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0 163 chrX 24055 40 50M = 24088 83 GAACCACAGAGAAAATAGAAACCAAGGAACAAAACACCCACCTCACAACG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<AEEEEEEEEEEE MD:Z:48G1 PG:Z:MarkDuplicates XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-10 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 99 chrX 24083 42 50M = 24178 145 ACAAAACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAAT AAAAAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE MD:Z:20G15T13 PG:Z:MarkDuplicates XG:i:0 NM:i:2 XM:i:2 XN:i:0 XO:i:0 AS:i:-10 YS:i:0 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 99 chrX 24083 42 50M = 24178 145 ACAAAACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAAT AAAAAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEE/EEE MD:Z:20G15T13 PG:Z:MarkDuplicates XG:i:0 NM:i:2 XM:i:2 XN:i:0 XO:i:0 AS:i:-10 YS:i:0 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 99 chrX 24083 42 50M = 24178 145 ACAAAACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAAT AAAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MD:Z:20G15T13 PG:Z:MarkDuplicates XG:i:0 NM:i:2 XM:i:2 XN:i:0 XO:i:0 AS:i:-10 YS:i:0 YT:Z:CP
# GAATTCGTGCGCTATGGTTTAATTAGCCGTACTGAC:0 83 chrX 24088 40 50M = 24055 -83 ACACCCACCTCACAACGAGAAGACCAGGTATCAGGATGTAGGAATACAAT EEEEEEEEEEEEEEEEEEEEEEEA/EEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:15G15T18 PG:Z:MarkDuplicates XG:i:0 NM:i:2 XM:i:2 XN:i:0 XO:i:0 AS:i:-10 YS:i:-5 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 147 chrX 24178 42 50M = 24083 -145 AATAATAGCCTGAACAATACATCTCCATTAAAGCCCAGCAACCCTACCAC EE/EEEEEEEE/EEEEEEEEE6EEEEEEEAEEEEEEEEEEEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:-10 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 147 chrX 24178 42 50M = 24083 -145 AATAATAGCCTGAACAATACATCTCCATTAAAGCCCAGCAACCCTACCAC EEAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:-10 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 147 chrX 24178 42 50M = 24083 -145 AATAATAGCCTGAACAATACATCTCCATTAAAGCCCAGCAACCCTACCAC EEAEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:-10 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 99 chrX 24589 42 50M = 24948 409 AAGTTTGGGTTGAAAGTTTTGTGGGTGAGTTTGAAGTGGAAACTTCTAGT AAAAAEEEEEEAEEEEEEEEAEEEEEEEEEEE6E//EEEE/<EEEEEEEE MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTTCTATCGCTGAGGCGAGAGCTAATCTTA:0 147 chrX 24948 42 50M = 24589 -409 AAAAACTCACCTAAGAACTGCTTGTGAGGTTCCTGCAGCAACTTGTAGTT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 XS:i:-30 YS:i:0 YT:Z:CP
# For example, what I observe:
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 99 chrX 3041 42 50M = 3205 214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 99 chrX 3041 42 50M = 3205 214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:-3 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 147 chrX 3205 42 50M = 3041 -214 CTGGGTGGGACACAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 147 chrX 3205 42 50M = 3041 -214 CTGGGTGGGACTCAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA EEEEE<EEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA MD:Z:11A38 PG:Z:MarkDuplicates XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-3 YS:i:0 YT:Z:CP
# Instead, I expect this:
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 99 chrX 3041 42 50M = 3205 214 AGTTAATTCATATGGATAACAGCAATAATACTGAGCCCCTGAGACTACTT AAAAAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
# GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0 147 chrX 3205 42 50M = 3041 -214 CTGGGTGGGACACAGTCTTCTCTTGTTTTCTTTTGTAACAAGATGTAGAA EEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EEEEEEAAAAA MD:Z:50 PG:Z:MarkDuplicates XG:i:0 NM:i:0 XM:i:0 XN:i:0 XO:i:0 AS:i:0 YS:i:0 YT:Z:CP
In checking the log that results from running picard Markduplicates
, I see the following:
WARNING 2022-02-27 21:13:22 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC:0. Cause: String 'GAATTCGTGCGCGTTCATCAGAGAGGTCGTACTGAC' did not start with a parsable number.
Again, fastq sequence identifiers have been converted from sequencing platform-specific information to nucleus barcode information, and thus sequencing platform information is no longer found in bam qname fields, which I think is needed by picard MarkDuplicates
.
Is it possible to use picard MarkDuplicates
for deduplication of merged bam files resulting from this pipeline?
the tool has multiple settings to remove duplicates, and the explanations are a bit complex,
in your case I would recommend filtering by flag, note how some of your flags don't really make sense
you have it labeled as both read1 and read 2, and also secondary and supplementary. Seems pretty odd frankly.
Or get this:
unmapped, duplicated and supplementary? ...
long story short, I think you'll get better results and more control if you filter by flag rather than have the tool remove alignments
Thank you for the response. Sorry, I should have included a header for column names: 3041 and 3205 are in
pos
fields; the flags are 99 and 147.Ah yes, my bad I misread the file since every line started with a
#
it seemed like a more custom format.I will say that replacing read names is a bad practice as it destroys very important information in the BAM file (information on the pairing) that later is impossible to reliably restore. I suspect Picard is misbehaving for that reason.
If you need to add the barcode then add it after a unique identifier
@1-ATGC
. (Just adding that won't solve the optical duplicate detection because that needs more information). The proper way by the way would be to tag each alignment with the read group that marks it with the appropriate information.I would give the samtools rmdup a try to remove duplicates:
Thank you again.
samtools rmdup
works well in this situation, andrepair
from the Subread package is a nice and fast tool to get paired read-ends together.