Hi All,
Again on the issue of marking duplicates...
If you have single end reads tools like Picard MarkDuplicates mark as duplicates reads sharing the same 5'end coordinates and ignore the 3' coordinate (see here). Is this the right thing to do?
I can see the point in looking at only 5' coordinates when the read length was quite a bit shorter then the fragment length (say 35-70 bp) and sequence quality dropped after ~50bp.
But now you can get reads up to 300bp, which is well into or above the range of the fragment length. So reads sharing the same 5'end but different 3'end might well be genuine independent fragments and MarkDuplicates would get them wrong. Am I missing something? Does anybody know of a tool that looks at both 5' and 3' ends for SE reads?
I think the 3' end isn't taken into consideration in the Picard algorithm because read quality decays variably at the 3' end of a read. Aligners such as BWA usually soft clip these low-quality bases. Thus two reads which truly were PCR duplicates might not be marked as such if one had a different soft-clipping pattern. If you took into account the 3' end, you would introduce a higher proportion of false negatives as a result.
Hi- I see your point, but that's what I'm questioning. Nowadays read quality stays high for up to 300bp (MiSeq, I've seen it!) so I think ignoring 3'ends has become overly conservative, especially if your fragment size is quite small ~100-200bp.
Last 100 bases would be adapter sequence which would be trimmed before further processing (i.e. before alignment).