Question

Regarding PCR duplicates vs duplicate entries in a bedfile

0

Entering edit mode

9.5 years ago

eudoraleer ▴ 10

Hi All,

I was confused about the term "duplicates" used in NGS. What is its definition? How is it defined? Does it mean that a PCR duplicate is a single read that is been mapped to more than one region?

I did an intersect between two bedfiles and there are duplicate entries been produced. Are those called duplicates / PCR duplicates?

The followings are my bed files entries and output:

Bedfile A:

chr1   98164856 98164948        M01882:88:000000000-ABG4T:1:2119:15091:9077/1   60      -
chr1   98164857 98164948        M01882:88:000000000-ABG4T:1:2119:15091:9077/2   60      +

Bedfile B:

chr1    98164703        98164863
chr1    98164864        98165170

CommandL

bedtools intersect -a Bedfile_A -b Bedfile_B -wa -wb

Output:

chr1   98164856 98164948        M01882:88:000000000-ABG4T:1:2119:15091:9077/1   60      -       chr1    98164703        98164863
chr1   98164856 98164948        M01882:88:000000000-ABG4T:1:2119:15091:9077/1   60      -       chr1    98164864        98165170
chr1   98164857 98164948        M01882:88:000000000-ABG4T:1:2119:15091:9077/2   60      +       chr1    98164703        98164863
chr1   98164857 98164948        M01882:88:000000000-ABG4T:1:2119:15091:9077/2   60      +       chr1    98164864        98165170

Please help me to address my concerns, thanks!

duplicates pcr bed • 2.7k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.5 years ago by eudoraleer ▴ 10

Ram · Answer 1 · 2015-11-20

2

Entering edit mode

9.5 years ago

Devon Ryan 105k

The definition of "duplicate" in NGS is no different from standard English. Your entries from BED file A are repeated in the intersect because they overlap multiple entries in BED file B. You may be interested in the -u option.

PCR duplicates are created via PCR. You're not doing PCR, you're using a computer, so those can't be PCR duplicates.

ADD COMMENT • link 9.5 years ago by Devon Ryan 105k

0

Entering edit mode

how about the bam files? Are there any PCR duplicates in them? Besides, what exactly is PCR duplicates, I have no idea and websites do not give much clue on the definition of PCR duplicates. Thanks!

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by eudoraleer ▴ 10

0

Entering edit mode

PCR duplicates are not specific to a file format because they do not arise due to a computational problem. Asking if PCR duplicates are present in a BAM file or BED file is like asking if fuel contamination happens in Petrol or Diesel. They're an experimental artifact, due to a technique named, wait for it, PCR. You can read about PCR at many places (including Wikipeda, which does a good job in explaining the basics), but the major take home message is it is used to create exact copies of the same DNA.

Now, PCR duplicates in an NGS experiment occur because NGS libraries are PCR amplified before they are loaded onto the sequencing machines. So, in the output, if there are multiple copies of the same DNA fragment, you will have several reads that map to exactly the same genomic locations. For starters, this typically means several BED entries (or BAM entries) at the same chromosome and start (and/or end position depending on the read length, which in turn depends on the sequencing platform).

Hope this helps.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Tej Sowpati ▴ 250

0

Entering edit mode

Thank you crackerjack! However, what I do not understand is that, during PCR amplification, aren't we suppose to amplify the same set of fragments? If so, there will definitely be multiple copies of the same reads after PCR amplification. Are those amplified fragments = duplicates? Then what is the use of amplification if we do not want duplicate reads. I am not sure why I do not understand, maybe because I missed out bits and pieces of information about NGS.

ADD REPLY • link 9.5 years ago by eudoraleer ▴ 10

0

Entering edit mode

Yes, the amplification process is expected to produce duplicates. The reason this is done is that many steps in molecular biology (e.g., ligation) are much more efficient with higher concentrations of DNA/RNA/etc. This why the PCR is done. Having said that, we don't normally want to pay attention to these duplicates when it comes to the analysis stage (at least if you're doing variant calling), so these can be marked accordingly by samtools or picard tools.

ADD REPLY • link 9.5 years ago by Devon Ryan 105k

Ram · Answer 2 · 2015-11-23

1

Entering edit mode

9.5 years ago

Antonio R. Franco ★ 5.2k

You are supposed to break the DNA or cDNA into random shotgun fragments, so chances to get identical fragments sequenced have to be relatively low.

Since PCR amplification can be biased, you can end with a population of duplicate sequences with a frequency higher than expected

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.5 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Thank you very much! It is very informative.

ADD REPLY • link 9.5 years ago by eudoraleer ▴ 10