Hi~ I'm working on some WGBS data now.
After quality and adapter trimming, Sequence Duplication Levels and Per sequence GC content still cannot pass . In Per sequence GC content, the read peak is higher than theoretical distibution. Is this ok ?
Thank you very much if you can provide some help !
Here are some pictures from FastQC after trimming.
There is a small fluctuation at the first few bases.Should I trim it ? At the end, the sharp decrease of A at the last position is a result of removing the adapter sequence very stringently, i.e. even a single trailing A at the end is removed.
Should I deduplicate sequence during quality control ( before mapping ) or filtering reads after alignments using deduplicate_bismark ?
Thank you very much ! Do you mean duplicates can be kept before mapping ?
Correct, there's no reason to bother deduplicating before alignment.
Yes. Usually you identify duplicates if two reads (or read pairs) align to the same exact spot in the genome.
For pair-end alignments, does bismark consider a duplicate if both partner reads start and end at the exact same position ? Or if only one of the partner reads ?
Oh, I figure it out. A duplicate is which both partner reads start and end at the exact same position. Thank you very much.