I have a HLA based NGS data from Myseq. How to deal with the overlap in NGS data when the read one and read two of a read pair (PE) overlap more than 90 % or even they contain the same exact sequence among them? I am working on pre-processing script that goes with the pipeline already present.
Why do you want to do something with them ? Can't you treat them as normal PE data ?
But wont they affect the coverage calculation when,some regions will have more reads(due to read pairs being the same) and some region having less number of reads (due to read pairs in that region don't or have very little overlap)?
No, it doesn't matter. The insert size should generally be independent of the genome, but regardless, the coverage is not really affected by the insert size (other than +-1).
The most important thing to do in this case is to adapter-trim reads, as inserts shorter than read length will have adapter sequence that will cause poor mapping.
If read pairs overlap some tools might double count coverage in the overlapping portion, which is incorrect as you are just sequencing the same fragment twice (I'm not sure if this what crivenster meant though).
I have no idea what I was thinking when I said +-1, you're correct, the difference can be a factor of 2. The point I was trying to make was that this will be evenly distributed everywhere so it shouldn't really affect a coverage analysis much. Once you have sufficient coverage at some location, the insert size will not matter very much.
BBMap has a "physcov" flag that will allow calculation of physical coverage, meaning that with large inserts the unsequenced bases in the middle will be counted, and with short inserts the double-covered bases will only be counted once, rather than twice. But I think if this analysis was done with physical coverage enabled versus disabled the conclusion would be the same.