Hello!
I am trying to call mutations from tumor only human samples (enriched with gene panel) from illumina. One of the steps includes marking duplicates like this:
picard MarkDuplicates INPUT=$ALIGNED_BAM OUTPUT=$MARKDUP_BAM METRICS_FILE=$MARKDUP_METRICS REMOVE_DUPLICATES='false' ASSUME_SORTED='true' DUPLICATE_SCORING_STRATEGY='SUM_OF_BASE_QUALITIES' OPTICAL_DUPLICATE_PIXEL_DISTANCE='100' VALIDATION_STRINGENCY='LENIENT' QUIET=true VERBOSITY=ERROR
But this step returns conspicuously few reads, e. g. at one mutant locus i have:
Before: After:
chr1:155 904 799 chr1:155 904 799
Total count: 8041 Total count: 323
A : 8 (0%, 8+, 0- ) A : 0
C : 7903 (98%, 4353+, 3550- ) C : 259 (80%, 163+, 96- )
G : 0 G : 0
T : 130 (2%, 0+, 130- ) T : 64 (20%, 0+, 64- )
N : 0 N : 0
You may notice that it selectively removes mostly the prevalent (reference) reads.
However in the loci not covered by the gene panel and therefore less represented it acts less eagerly, e. g.:
Before: After:
chr1:155 902 971 chr1:155 902 971
Total count: 48 Total count: 18
A : 38 (79%, 24+, 14- ) A : 14 (78%, 6+, 8- )
C : 0 C : 0
G : 10 (21%, 10+, 0- ) G : 4 (22%, 4+, 0- )
T : 0 T : 0
N : 0 N : 0
Is this the way it is supposed to work? If no, how can i make it leave more of my precious reads? If yes, what could be a problem with my sequencing?
Thanks in advance.
You have
remove duplicates
set tofalse
so no reads should be removed.I already have REMOVE_DUPLICATES='false' . It writes file about as big as the input file, but IGV does not show most of the reads in the resulting file, and other GATK tools seem to ignore them.
Check overrepresented reads in Fastqc report and the number of duplicate reads with
samtools flagstat
. If the layout is paired, check library complexity withpicard EstimateLibraryComplexity
.What does this mean?
picard EstimateLibraryComplexity https://pastebin.com/sCi7iNHh
I found that i passed two first read fastq files instead of first and second reads. :( However with correct input files it stays basically the same.
Can you share the
EstimateLibraryComplexity
for the current alignment and thesamtools flagstat
output after the mark duplicates step is performed?picard EstimateLibraryComplexity https://pastebin.com/KiecjBD7
In IGV make sure you have "filter duplicate reads" option unchecked in Preferences --> Alignments That may simply be the reason why IGV is not showing the reads after dup marking.