Question

How can I identify which read is optical duplicate in SAM file?

0

Entering edit mode

9.1 years ago

Tori ▴ 90

I used Picard's MarkDuplicates and got following output in the terminal.

[Tue Nov 17 15:16:43 EET 2015] net.sf.picard.sam.MarkDuplicates INPUT=[../Tophat2-downsampled/28391/accepted_hits.bam] OUTPUT=28391.deduped.sam METRICS_FILE=28391.txt REMOVE_DUPLICATES=false OPTICAL_DUPLICATE_PIXEL_DISTANCE=75    ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Tue Nov 17 15:16:43 EET 2015] Executing as bishwa@portal.rack2 on Linux 3.10.0-229.14.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_65-b17; Picard version: 1.77(1266)
INFO    2015-11-17 15:16:43    MarkDuplicates    Start of doWork freeMemory: 502696952; totalMemory: 506462208; maxMemory: 5726797824
INFO    2015-11-17 15:16:43    MarkDuplicates    Reading input file and constructing read end information.
INFO    2015-11-17 15:16:43    MarkDuplicates    Will retain up to 22725388 data points before spilling to disk.
INFO    2015-11-17 15:16:45    MarkDuplicates    Read 191568 records. 0 pairs never matched.
INFO    2015-11-17 15:16:46    MarkDuplicates    After buildSortedReadEndLists freeMemory: 436332280; totalMemory: 638582784; maxMemory: 5726797824
INFO    2015-11-17 15:16:46    MarkDuplicates    Will retain up to 178962432 duplicate indices before spilling to disk.
INFO    2015-11-17 15:16:47    MarkDuplicates    Traversing read pair information and detecting duplicates.
INFO    2015-11-17 15:16:47    MarkDuplicates    Traversing fragment information and detecting duplicates.
INFO    2015-11-17 15:16:47    MarkDuplicates    Sorting list of duplicate records.
INFO    2015-11-17 15:16:47    MarkDuplicates    After generateDuplicateIndexes freeMemory: 632341752; totalMemory: 2070413312; maxMemory: 5726797824
INFO    2015-11-17 15:16:47    MarkDuplicates    Marking 2681 records as duplicates.
INFO    2015-11-17 15:16:47    MarkDuplicates    Found 129 optical duplicate clusters.
INFO    2015-11-17 15:16:48    MarkDuplicates    Before output close freeMemory: 2301791296; totalMemory: 2313682944; maxMemory: 5726797824
INFO    2015-11-17 15:16:49    MarkDuplicates    After output close freeMemory: 2314374208; totalMemory: 2326265856; maxMemory: 5726797824
[Tue Nov 17 15:16:49 EET 2015] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=2326265856

The terminal output says there are 129 optical duplicates clusters. How can I know the optical duplicate cluster's tile number, x coordinate and y coordinate?

sam picard • 4.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.1 years ago by Tori ▴ 90

Ram · Accepted Answer · 2015-11-17

2

Entering edit mode

9.1 years ago

Pierre Lindenbaum 164k

How can I know the optical duplicate cluster's tile number, x coordinate and y coordinate?

See the illumina read name nomenclature: https://en.wikipedia.org/wiki/FASTQ_format

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

where

EAS139      the unique instrument name
136         the run id
FC706VJ     the flowcell id
2           flowcell lane
2104        tile number within the flowcell lane
15343       'x'-coordinate of the cluster within the tile
197393      'y'-coordinate of the cluster within the tile
1           the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y           Y if the read is filtered, N otherwise
18          0 when none of the control bits are on, otherwise it is an even number
ATCACG      index sequence

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Those field you mentioned are of fastq files. Picard's MarkDuplicates take BAM/SAM as input and outputs BAM/SAM.

ADD REPLY • link 9.1 years ago by Tori ▴ 90

0

Entering edit mode

but they are part of the ILLUMINA sam reads. Look at the name of your reads.

ADD REPLY • link 9.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

One of the reads from SAM file looks likes this

J00117:26:H3YC3BBXX:5:1224:28384:20918  419     chr1    12734   3       101M    =       12855   222     TAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCATCTCCTGTCTCCTGGAGAGGCTTCGATGCCCCTCCACACCCTCTTGATCTT   AAFFFKKFKKKAFKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKAFKK,AKKKKFFKKKKKKKFKKKKKKKKF7AKKAFFFKFK<AKFKK,FAAFFFKFKK,   AS:i:-6 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:49G51      YT:Z:UU NH:i:2  CC:Z:chr15      CP:i:102518336  HI:i:0

How do I know if this read is optical duplicate?

ADD REPLY • link 9.1 years ago by Tori ▴ 90

0

Entering edit mode

It's not event a 'any-kind-of' duplicate because 419 doesn't contain the DUP flag (1024): https://broadinstitute.github.io/picard/explain-flags.html

in other case, view the reads before and after this read

e.g.

samtools view your.bam | grep -F "READNAME" -A 10 -B 10

check, there a DUP read at the same chrom+pos , use the names of the reads to check if the cartesian distance is small.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

@Pierre Lindenbaum Thanks for your informative comment. So, PCR duplicates and optical duplicates can not be identified from BAM/SAM file because both have same flag 1024. If I plot x and y coordinates I can possibly see in the plot the location of the duplicates. If there are points very close to one another, then they are optical duplicates. Please correct me if I am wrong. By the way, what does it mean if the flag 1024 is negative. I also found negative 1024 flag in my SAM file.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by Tori ▴ 90

0

Entering edit mode

If I plot x and y coordinates I can possibly see in the plot the location of the duplicates.

Yes,... possibly

it mean if the flag 1024 is negative

It can't be negative because it's an array of bits. If there is a negative flag, your bam is corrupted.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.1 years ago by Pierre Lindenbaum 164k