Question

markDuplicates: Histogram not produced

1

Entering edit mode

10.1 years ago

dariober 15k

Hi All- Picard/markDuplicates sometimes produces the histogram of sequencing vs gain and sometimes doesn't. Does anybody know how/why markDuplicates decides this?

(I haven't checked this carefully and I haven't looked at the code... apologies if I missed it in the docs.)

Edit: Java command:

java -jar -Xmx2g ~/bin/MarkDuplicates.jar I=input.bam O=/dev/null M=md.txt AS=true VALIDATION_STRINGENCY=SILENT

java -jar ~/bin/MarkDuplicates.jar --version
1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)

Edit2: Examples of a metrics file with histogram (I stripped some rows) and one without

cat grm030_pb_BS.mm9_pb11.md.txt
## htsjdk.samtools.metrics.StringHeader
# picard.sam.MarkDuplicates INPUT=[/lustre/sblab/berald01/repository/bwameth/grm030_pb_BS.mm9_pb11.bam] OUTPUT=/dev/null METRICS_FILE=grm030_pb_BS.mm9_pb11.md.txt ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Aug 01 11:23:16 BST 2014

## METRICS CLASS    picard.sam.DuplicationMetrics
LIBRARY    UNPAIRED_READS_EXAMINED    READ_PAIRS_EXAMINED    UNMAPPED_READS    UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION    ESTIMATED_LIBRARY_SIZE
    252563    3095840    1110601    38000    182731    156031    0.062608    160862710

## HISTOGRAM    java.lang.Double
BIN    VALUE
1.0    1.052566
2.0    2.085069
3.0    3.097891
4.0    4.091408
...
98.0    46.844894
99.0    47.00454
100.0    47.161142

This file has no histogram:

cat grm035_REBUiLT_AD06.clean.pb11.md.txt
## htsjdk.samtools.metrics.StringHeader
# picard.sam.MarkDuplicates INPUT=[/lustre/sblab/berald01/repository/bwameth/grm035_REBUiLT_AD06.clean.pb11.bam] OUTPUT=/dev/null METRICS_FILE=grm035_REBUiLT_AD06.clean.pb11.md.txt ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Aug 01 11:23:15 BST 2014

## METRICS CLASS    picard.sam.DuplicationMetrics
LIBRARY    UNPAIRED_READS_EXAMINED    READ_PAIRS_EXAMINED    UNMAPPED_READS    UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION    ESTIMATED_LIBRARY_SIZE
    46913    7889832    0    18667    74972    49840    0.010654    1220238098
Unknown Library    135291    21433156    0    96225    323809    143791    0.017298    1251759451

Thanks!

picard markDuplicates • 4.8k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dariober 15k

0

Entering edit mode

what's your java command line please ?

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for replying Pierre. Java command included, I thought there is nothing special there...

ADD REPLY • link 10.1 years ago by dariober 15k

0

Entering edit mode

and md.txt is empty or not produced ?

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I put a couple of example files, they probably look messy. I kind of suspect that if there are multiple read groups in the input bam file the is histogram is not produced, could it be the case?

ADD REPLY • link 10.1 years ago by dariober 15k

Ram · Answer 1 · 2014-10-09

In your most recent comment, you said:

I kind of suspect that if there are multiple read groups in the input bam file the is histogram is not produced, could it be the case?

I think you're on the right track. In the code for markduplicates/util/AbstractMarkDuplicatesCommandLineProgram.java:

172  if(metricsByLibrary.size() == 1) {
173       file.setHistogram(metricsByLibrary.values().iterator().next().calculateRoiHistogram());
174  }

Ram · Answer 2 · 2014-10-09

Thanks Deedee for digging the code!

I tested on one of my files and it seems to be the case that markDuplicates doesn't produce the histogram with multiple read groups. Here's the code I've used if anybody cares:

Bam file with read groups:

bam=mybam.bam
java -jar -Xmx2g ~/bin/MarkDuplicates.jar I=$bam O=/dev/null M=test_RG.md.txt AS=true VALIDATION_STRINGENCY=SILENT

grep 'HISTOGRAM' test_RG.md.txt # Nothing grepped

Now strip the read group info from the bam file and re-run markDuplicates:

samtools view -@4 -h $bam | grep -v -P '^@RG\t' | sed -r 's/RG:.+?\t//' | samtools view -@4 -Sb - > test.$bam 

java -jar -Xmx2g ~/bin/MarkDuplicates.jar I=test.$bam O=/dev/null M=test.md.txt AS=true VALIDATION_STRINGENCY=SILENT

grep -A10 'HISTOGRAM' test.md.txt
## HISTOGRAM    java.lang.Double
BIN    VALUE
1.0    1.003927
2.0    1.97376
3.0    2.910658
4.0    3.815738
5.0    4.690082
6.0    5.534732
7.0    6.350698
8.0    7.138954
9.0    7.90044