I am looking at reverse read (R2) from a dataset from 2X150 paired end Illumina platform, transcriptome data. As observed from below plot (mean phred score distribution per base of read), a sudden "dip" could be seen at base number 5th, 6th and 7th. I am wondering:
What could be the best explanation for such dip? A problem with library preparation or a technical problem with sequencer? Another observation is that a major chunk of data sets is affected by this issue which is coming from the same sequencing batch.
To get rid of this dip, I did a trimmomatic "HEADCROP" upto 7-8 bases which considerably improved the distribution for obvious reasons, however, this affected the "Sequence Duplication Levels" metric in the way that the "Percent of sequences remaining after deduplication" dropped from 71.7% to 32.8% as show here -
Before trimming
and After trimming What could be the explanation? I also, went through this biostar post with a little help.
Is this across all the lanes and tiles? If so it's a machine error (focusing issues or such). If not, it's probably a bubble (or series of them).
I suspect the sequence duplication level after cropping is closer to the truth, possibly it was masked before cropping by low quality / sequencing errors.