Question

high duplicated level in SRA

0

Entering edit mode

8.3 years ago

kk.mahsa ▴ 150

i downloaded SRA files from NCBI, then converted them to fastq. After quality control using FASTQC, i figure out that level of duplication is high ( percent of seqs remaining if deduplicated 24.45). i want to use these data to variant calling and my question is: is it good idea (variant calling) when duplication level is high like my case? can deduplicated after alignment (using Picard) solve my problem?

SRA SNP alignment • 2.3k views

ADD COMMENT • link updated 8.3 years ago by agata88 ▴ 870 • written 8.3 years ago by kk.mahsa ▴ 150

score 1 · Answer 1 · 2017-05-16

1

Entering edit mode

8.3 years ago

lakhujanivijay 5.9k

Things to be kept in mind while looking at this metric in FastQC (Here is the source)

To cut down on the memory requirements for this module only sequences which first appear in the first 100,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file.
Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

Consider reading these posts here and here.

ADD COMMENT • link 8.3 years ago by lakhujanivijay 5.9k

0

Entering edit mode

are there any program or scripts to estimate real duplicated rate in fastq files?

ADD REPLY • link 8.3 years ago by kk.mahsa ▴ 150

1

Entering edit mode

Fastqc is also only single-ended duplication levels.

Its not possible to measure duplication levels pre-alignment. At best a program might test for the identity of the sequence in both end of a pair. But reads can be duplicates without having identical sequences (e.g. via sequencing errors). The only real way is to align and then get Picard to measure the duplication statistics. You'll want to remove duplicates with Picard MarkDuplicates either way. I say just bite the bullet and align it.

That said if you do want to do paired-end, fastq de-duplication, the tally tool will do that for you: http://www.ebi.ac.uk/research/enright/software/kraken.

You'll still have to run MarkDuplicates after aligning though.

ADD REPLY • link 8.3 years ago by i.sudbery 22k

score 0 · Answer 2 · 2017-05-16

0

Entering edit mode

8.3 years ago

agata88 ▴ 870

I would suggest to remove them before variant calling. The percentage is high and may cause problems during detection of homo or hetero variants (shifted frequency of alt allele).

Best, Agata

ADD COMMENT • link 8.3 years ago by agata88 ▴ 870

0

Entering edit mode

i am going to remove duplicated after alignment (using Picard) but i am not sure that is correct way or not

ADD REPLY • link 8.3 years ago by kk.mahsa ▴ 150