Question

What to do when reads pairs are almost same,like 90-100 % overlap each other in paired reads?

0

Entering edit mode

9.5 years ago

crivenster ▴ 50

I have a HLA based NGS data from Myseq. How to deal with the overlap in NGS data when the read one and read two of a read pair (PE) overlap more than 90 % or even they contain the same exact sequence among them? I am working on pre-processing script that goes with the pipeline already present.

pre-processing NGS next-gen hla • 4.7k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 9.5 years ago by crivenster ▴ 50

0

Entering edit mode

Why do you want to do something with them ? Can't you treat them as normal PE data ?

ADD REPLY • link 9.5 years ago by GouthamAtla 12k

0

Entering edit mode

But wont they affect the coverage calculation when,some regions will have more reads(due to read pairs being the same) and some region having less number of reads (due to read pairs in that region don't or have very little overlap)?

ADD REPLY • link 9.5 years ago by crivenster ▴ 50

1

Entering edit mode

No, it doesn't matter. The insert size should generally be independent of the genome, but regardless, the coverage is not really affected by the insert size (other than +-1).

The most important thing to do in this case is to adapter-trim reads, as inserts shorter than read length will have adapter sequence that will cause poor mapping.

ADD REPLY • link updated 21 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

2

Entering edit mode

If read pairs overlap some tools might double count coverage in the overlapping portion, which is incorrect as you are just sequencing the same fragment twice (I'm not sure if this what crivenster meant though).

ADD REPLY • link 9.5 years ago by dariober 15k

2

Entering edit mode

I have no idea what I was thinking when I said +-1, you're correct, the difference can be a factor of 2. The point I was trying to make was that this will be evenly distributed everywhere so it shouldn't really affect a coverage analysis much. Once you have sufficient coverage at some location, the insert size will not matter very much.

BBMap has a "physcov" flag that will allow calculation of physical coverage, meaning that with large inserts the unsequenced bases in the middle will be counted, and with short inserts the double-covered bases will only be counted once, rather than twice. But I think if this analysis was done with physical coverage enabled versus disabled the conclusion would be the same.

ADD REPLY • link updated 21 months ago by Ram 44k • written 9.5 years ago by Brian Bushnell 20k

Ram · Answer 1 · 2015-05-12

1

Entering edit mode

9.5 years ago

Alvaro Sebastian ▴ 70

Merge them, the joining program will take the nt with better quality in each position: http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html

Is amplicon sequencing data (PCR products sequencing)? I think with DNA fragmentation is more difficult to have this problem.

ADD COMMENT • link updated 21 months ago by Ram 44k • written 9.5 years ago by Alvaro Sebastian ▴ 70

0

Entering edit mode

There have been better tools developed for merging reads via overlap since 2012, but merging is not really necessary in this case.

ADD REPLY • link 9.5 years ago by Brian Bushnell 20k

Ram · Answer 2 · 2015-05-12

1

Entering edit mode

9.5 years ago

dariober 15k

After mapping PE reads, I usually soft clip the overlapping part of one of the two reads. There is a nice program for this: clipOverlap, I think it is better to clip after mapping rather than merging reads as Alvaro suggests. Also take care that if read pairs overlap by 100% some aligners might not mark them as "mapped in proper pair", whereas I think they are.

ADD COMMENT • link updated 21 months ago by Ram 44k • written 9.5 years ago by dariober 15k

0

Entering edit mode

So its better to perform read merging or soft clipping as u mentioned when reads overlap 100 % before alignment ? this way,i can avoid the alignment of same region twice and may be get better mapping results? The hla data i use,is generated has a fragment lengths between 200-500 bp,as some of the sequencing regions are only 250 bp in length. thus the chances of overlap is certain and having 100% overlap has been common in data generated here.

ADD REPLY • link 9.5 years ago by crivenster ▴ 50

1

Entering edit mode

As I said, I prefer to soft clip after mapping rather then merging. I don't take in consideration how much overlap there is between pairs (100% or just 1 base), I just clip whatever is overlapping.

ADD REPLY • link 9.5 years ago by dariober 15k