I am new to quality control steps so I have troubles interpreting the results of fastqc results on my cut&run dataset.
These are two "Per Base Sequence Content" from two libraries of the same sample.
The above plot doesn't seem to show non expected biases as I understand it should be related to tagmentation method or Illumina adaptater and should be removed during alignment.
Following is the plot below showing an increase of G% along the reads. Both library are "failing" according to fastq but I red that most biases should not prevent you from starting downstream analysis.
So I wanted to know if anyone can help me understand what is happening on the second plot to help me taking the decision to keep or remove reads from this library for downstream analysis.
Many thanks !
This is only a speculation but assuming this is done with two color chemistry, it is possible that you have a fraction of library with short inserts. With those once you run through the adapter on the other end the sequencing may be simply generating "G = No Signal = No calls". You can trim your data for stretches of poly-G's.
Thanks a bunch, I actually have poly-G's sequences as the top overrepresented sequence but I did not know what it meant.
GenoMax Actually I was misleaded by the tracks names and my sample_sheet file. The two plot are forward and reverse reads (R1 & R2) of the same library. It appear R1 have 2M+ reads with large 'GGGGGGGGGGGGGGGGGGGGGGGGG' insert while R2 has only 200k of these poly-G's sequences. Is this a behavior you ever saw ?
Pure poly-G reads are not usable. You should remove them or they will get dropped in subsequent analysis.