I stumbled across something I never saw before with raw FASTQ files during QC. We are working with ChIP-Seq samples (Mutant (MAR) vs WT, 3 replicates each, 1 IP and 1 Input (IN) for each replicates for a total of 12 samples).
A pattern is emerging while looking at mean quality scores per position in the raw reads : First 35 bp are somewhat low-quality, then there is a sudden increase in quality until the end of the read (75 bp), see this plot :
Here is what I know about the samples :
Samples were prepared using "TruSeq® ChIP Library Preparation Kit" (with 1% PhiX Control) and sequenced with Illumina NextSeq (single end 75 bp).
During samples preparation, after several steps, a point was reached with ~ 10M cells in 350uL. Then, 50uL were used for Input (~ 1.4M cells) & 2x150uL (~ 2x4.3M cells) for IP (one IP directed against the protein of interest and one non-specific IP for background noise).
Any suggestions for processing those reads ? Should I trim the first 35 bp regardless of the sample ?
Edit
I allow myself to edit the OP post to take into account all the good suggestions made in the replies. Especially regarding @Devon post to take depth bias into account.
Another thing to know about these samples is the very uneven number of raw reads between them (thus the importance to account for depth bias, which was not done previously) :
The 5 samples which seem to have the best raw_mean_qual_scores_plot are WT_IP_3
, WT_IP_2
, MAR_IN_2
, WT_IN_3
and MAR_IN_3
. They also are the samples with the highest number of raw reads.
These 5 samples happen to be the one considered as potential outliers on PCA plots generated from read count coming from un-normalized BAMs.
New PCA
Below are the relevant options used in cmds leading to a PCA plot on read counts from normalized BIGWIGs (mm10 normalizeTo1x) using deepTools 3.0.1 (using raw reads, no trimming nor duplicates removal) :
bamCoverage --binSize 10 \
--effectiveGenomeSize 2150570000 \
--normalizeUsing RPGC \
--ignoreForNormalization chrX;
multiBigwigSummary bins \
--chromosomesToSkip chrX;
Nb : I tried multiBigwigSummary with --binSize 10 but it was taking way too long
plotPCA --transpose;
Not off the bat. Have you tried to align the data as is? My guess is that should be good align-able data. One reason Q scores suffer is when basecaller perceives low-diversity sequence appearing in many clusters. In ChIPseq one would expect to see a specific sequence getting enriched but to have that as the most predominant sequence in all samples in first 35 cycles (to the extent that it caused the Q scores to drop) seems a bit far fetched.
Yes data was mapped as is vs reference genome : multiqc.
Thing is half my samples do not reach 50% uniquely mapped reads.
There may be an experimental explanation for what you are observing. Have all samples been QC'ed/treated the same way?
Yes all samples have been treated the same way on our side.
I asked if samples were on differents Illumina runs to those concerned, answer is no.