Hi,
I'm running Picard's EstimateLibraryComplexity on 12 bam files, that are pretty shallow (~400000 reads per file), no other arguments except for I and O, and am getting no output other the standard output messages.
Note that I do find duplicates in these data. For example this is the standard output of Picard for one sample:
INFO 2016-09-26 18:20:00 MarkDuplicates Start of doWork freeMemory: 2046635632; totalMemory: 2058354688; maxMemory: 28478275584
INFO 2016-09-26 18:20:00 MarkDuplicates Reading input file and constructing read end information.
INFO 2016-09-26 18:20:00 MarkDuplicates Will retain up to 113009030 data points before spilling to disk.
WARNING 2016-09-26 18:20:02 AbstractDuplicateFindingAlgorithm Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name '534371-1'. You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates. Note that this message will not be emitted again even if other read names do not match the regex.
INFO 2016-09-26 18:20:16 MarkDuplicates Read 129293 records. 0 pairs never matched.
INFO 2016-09-26 18:20:19 MarkDuplicates After buildSortedReadEndLists freeMemory: 1967321816; totalMemory: 2884632576; maxMemory: 28478275584
INFO 2016-09-26 18:20:19 MarkDuplicates Will retain up to 889946112 duplicate indices before spilling to disk.
INFO 2016-09-26 18:23:23 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2016-09-26 18:23:23 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2016-09-26 18:23:23 MarkDuplicates Sorting list of duplicate records.
INFO 2016-09-26 18:23:26 MarkDuplicates After generateDuplicateIndexes freeMemory: 3237064784; totalMemory: 10367795200; maxMemory: 28478275584
INFO 2016-09-26 18:23:26 MarkDuplicates Marking 91622 records as duplicates.
INFO 2016-09-26 18:23:26 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2016-09-26 18:23:36 MarkDuplicates Before output close freeMemory: 10352402680; totalMemory: 10367795200; maxMemory: 28478275584
INFO 2016-09-26 18:23:37 MarkDuplicates After output close freeMemory: 10352475912; totalMemory: 10367795200; maxMemory: 28478275584
But then this is the stard output of EstimateLibraryComplexity of the same sample:
INFO 2016-09-26 18:23:38 EstimateLibraryComplexity Will store 46230966 read pairs in memory before sorting.
INFO 2016-09-26 18:23:46 EstimateLibraryComplexity Finished reading - moving on to scanning for duplicates.
[Mon Sep 26 18:23:46 EDT 2016] picard.sam.EstimateLibraryComplexity done. Elapsed time: 0.12 minutes.
Anyone ever experienced that?
Could be a PCR-free library prep?
It's a selction for short RNAs (miRs) but MarkDuplicates reports 91622 records as duplicates
Have the data been trimmed to a length typical for miRs (~30bp)? EstimateLibraryComplexity matches the first 50bp to identify duplicates. It may not work if the read lengths are shorter (although I don't know for sure).
But it's unclear why you need this metric, since MarkDuplicates indicates that you're near saturation - 70% (91662/129293) of the reads are duplicates.