Entering edit mode
9.1 years ago
Tori
▴
90
I used Picard's MarkDuplicates and got following output in the terminal.
[Tue Nov 17 15:16:43 EET 2015] net.sf.picard.sam.MarkDuplicates INPUT=[../Tophat2-downsampled/28391/accepted_hits.bam] OUTPUT=28391.deduped.sam METRICS_FILE=28391.txt REMOVE_DUPLICATES=false OPTICAL_DUPLICATE_PIXEL_DISTANCE=75 ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Tue Nov 17 15:16:43 EET 2015] Executing as bishwa@portal.rack2 on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_65-b17; Picard version: 1.77(1266)
INFO 2015-11-17 15:16:43 MarkDuplicates Start of doWork freeMemory: 502696952; totalMemory: 506462208; maxMemory: 5726797824
INFO 2015-11-17 15:16:43 MarkDuplicates Reading input file and constructing read end information.
INFO 2015-11-17 15:16:43 MarkDuplicates Will retain up to 22725388 data points before spilling to disk.
INFO 2015-11-17 15:16:45 MarkDuplicates Read 191568 records. 0 pairs never matched.
INFO 2015-11-17 15:16:46 MarkDuplicates After buildSortedReadEndLists freeMemory: 436332280; totalMemory: 638582784; maxMemory: 5726797824
INFO 2015-11-17 15:16:46 MarkDuplicates Will retain up to 178962432 duplicate indices before spilling to disk.
INFO 2015-11-17 15:16:47 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2015-11-17 15:16:47 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2015-11-17 15:16:47 MarkDuplicates Sorting list of duplicate records.
INFO 2015-11-17 15:16:47 MarkDuplicates After generateDuplicateIndexes freeMemory: 632341752; totalMemory: 2070413312; maxMemory: 5726797824
INFO 2015-11-17 15:16:47 MarkDuplicates Marking 2681 records as duplicates.
INFO 2015-11-17 15:16:47 MarkDuplicates Found 129 optical duplicate clusters.
INFO 2015-11-17 15:16:48 MarkDuplicates Before output close freeMemory: 2301791296; totalMemory: 2313682944; maxMemory: 5726797824
INFO 2015-11-17 15:16:49 MarkDuplicates After output close freeMemory: 2314374208; totalMemory: 2326265856; maxMemory: 5726797824
[Tue Nov 17 15:16:49 EET 2015] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=2326265856
The terminal output says there are 129 optical duplicates clusters. How can I know the optical duplicate cluster's tile number, x coordinate and y coordinate?
Those field you mentioned are of fastq files. Picard's MarkDuplicates take BAM/SAM as input and outputs BAM/SAM.
but they are part of the ILLUMINA sam reads. Look at the name of your reads.
One of the reads from SAM file looks likes this
How do I know if this read is optical duplicate?
It's not event a 'any-kind-of' duplicate because 419 doesn't contain the DUP flag (1024): https://broadinstitute.github.io/picard/explain-flags.html
in other case, view the reads before and after this read
e.g.
check, there a DUP read at the same chrom+pos , use the names of the reads to check if the cartesian distance is small.
@Pierre Lindenbaum Thanks for your informative comment. So, PCR duplicates and optical duplicates can not be identified from BAM/SAM file because both have same flag 1024. If I plot x and y coordinates I can possibly see in the plot the location of the duplicates. If there are points very close to one another, then they are optical duplicates. Please correct me if I am wrong. By the way, what does it mean if the flag 1024 is negative. I also found negative 1024 flag in my SAM file.
Yes,... possibly
It can't be negative because it's an array of bits. If there is a negative flag, your bam is corrupted.