i downloaded SRA files from NCBI, then converted them to fastq. After quality control using FASTQC, i figure out that level of duplication is high ( percent of seqs remaining if deduplicated 24.45). i want to use these data to variant calling and my question is: is it good idea (variant calling) when duplication level is high like my case? can deduplicated after alignment (using Picard) solve my problem?
are there any program or scripts to estimate real duplicated rate in fastq files?
Fastqc is also only single-ended duplication levels.
Its not possible to measure duplication levels pre-alignment. At best a program might test for the identity of the sequence in both end of a pair. But reads can be duplicates without having identical sequences (e.g. via sequencing errors). The only real way is to align and then get Picard to measure the duplication statistics. You'll want to remove duplicates with Picard MarkDuplicates either way. I say just bite the bullet and align it.
That said if you do want to do paired-end, fastq de-duplication, the tally tool will do that for you: http://www.ebi.ac.uk/research/enright/software/kraken.
You'll still have to run MarkDuplicates after aligning though.