Hello!
I am trying to analyze data from Illumina sequencing and observe enormous amount (up to 50%) of hybrid reads, i.e. those that align (with BWA) to two different positions on genomes. I suppose that this is lead by improper library preparation, specifically skipping or failing dephosphorylation (so reads sticked together). The question is should I consider any specific additional steps during data analyzis when I manage such data? Removing all of them will dramatically reduce coverage and I do realize that though reads are hybrid, parts are derived from specific places on genomes and still can be used to obtain information about it. Though I am still confused that leaving hybrid reads may introduce any errors. Any thoughts?
You mean part of the read (say first 50 nt) align to one part of the genome and another part of the read (say last 50 nt) align to another part of the genome?
Are these paired-end reads or single-end reads?
Yes, reads are paired.
Yes, you got it right
I have not seen this happen before. Can you share a few lines of SAM output, just to be sure we are all talking about the same thing?