Hi,
I've paired-end 2x100bp targeted DNA-seq reads that spans multiple regions in the genome. Read 2 contains 2 barcodes :
- bp 1-10 : barcode 1
- bp 11-19 : barcode 2
These barcodes are usefull to distinguish the differents samples (barcode 2) , and between DNA fragment (barcode 2). What I want is a bam file for each sample and to remove the duplicate reads (same barcode 1 and same alignment position). I saw in PICARD MarkDuplicates a barcode option :
BARCODE_TAG (String) Barcode SAM tag (ex. BC for 10X Genomics) Default value: null.
READ_ONE_BARCODE_TAG (String) Read one barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
READ_TWO_BARCODE_TAG (String) Read two barcode SAM tag (ex. BX for 10X Genomics) Default value: null.
But I'm a little bit lost how to specify to picard the position within read 2 to check. Any ideas ?
If PICARD is not suited for this task, I thought to parse R2 and extract barcode 1 and 2 remove the duplicates by checking alignment position and barcode informations..
Thanks
edit : I just found this paper discussing barcodes (or UMIs) : http://genome.cshlp.org/content/early/2017/01/18/gr.209601.116.abstract . A good start
The Ph. D. thesis of Kasper Karlsson is also a very good read about UMIs.