Entering edit mode
3.6 years ago
reza
▴
300
I checked my reads using FastQC and everything is ok about duplicated reads but during the MarkDuplication step, I loss 55% of reads as duplicated reads! what happens? Can I continue the downstream analysis with the output file from Picard after removing duplicated reads? 55% is normal!??
What kind of an experiment is this? Are you calling SNP?
yes, the first step is SNP calling, but I will use this data to identify the signature of selection and introgression in next steps
Not much you can do as long as you did the marking right. You may have done too many PCR cycles if the input DNA was low concentration.
What is the problem if I want to use this data for the analyzes I mentioned earlier (SNP calling, detection of the signatures selection, and introgression)?
You will need to provide more information about your experiment to get a truly useful answer. It's important to know that MarkDuplicates works by simply identifying reads/read pairs with identical mapping coordinates, so if your experiment is amplicon based or enriches for a small target it will give you a much higher estimate compared to the likely real number of duplicates (in these circumstances it may not be appropriate to mark duplicates). Also, data from single-end reads will produce a higher number of estimated duplicates than paired-end reads as the number of unique mapping positions will be fewer.
Whole-genome sequencing data are paired-end sequenced using Illumina Hiseq 2500 (150 bp) and I want to do SNP calling, detection of the signature of selection, and introgression. What information do I need to give to get the right answer?
55% is a very high number of duplicates for WGS. There's no hard and fast rule but I would generally expect closer to 10% for PCR-based WGS library prep, so it suggests to me that something either went wrong during the library prep or something is going wrong as your marking duplicates.
Is 55% the figure taken given by the metrics file that Picard produces or did you calculate this figure some other way?
55% is in metrics file outputted from Picard
Now, after deletion of duplicated reads, What is the problem if I want to use this data for the analyzes I mentioned earlier (SNP calling, detection of the signatures selection, and introgression)? Please help me to make the right decision on my data. I must ignore my data??
You can still call SNPs. Your variant caller should ignore any reads marked as duplicates so they won't interfere with your variant calling, but you should probably assess your depth of coverage after marking duplicates so you can infer your sensitivity to detect variants.