I have some fastq files from a ChIP-seq run that I'm trying to run on fastqc but I keep getting this notification partway through the analysis:
$ fastqc -nogroup FB_BcatIP_R1_001_trim.fastq
Started analysis of FB_BcatIP_R1_001_trim.fastq
Approx 5% complete for FB_BcatIP_R1_001_trim.fastq
Approx 10% complete for FB_BcatIP_R1_001_trim.fastq
Approx 15% complete for FB_BcatIP_R1_001_trim.fastq
Approx 20% complete for FB_BcatIP_R1_001_trim.fastq
Approx 25% complete for FB_BcatIP_R1_001_trim.fastq
Approx 30% complete for FB_BcatIP_R1_001_trim.fastq
Approx 35% complete for FB_BcatIP_R1_001_trim.fastq
Approx 40% complete for FB_BcatIP_R1_001_trim.fastq
Too many tiles (>500) so giving up trying to do per-tile qualities since we're probably parsing the file wrongly
Approx 45% complete for FB_BcatIP_R1_001_trim.fastq
Approx 50% complete for FB_BcatIP_R1_001_trim.fastq
Approx 55% complete for FB_BcatIP_R1_001_trim.fastq
Approx 60% complete for FB_BcatIP_R1_001_trim.fastq
Approx 65% complete for FB_BcatIP_R1_001_trim.fastq
Approx 70% complete for FB_BcatIP_R1_001_trim.fastq
Approx 75% complete for FB_BcatIP_R1_001_trim.fastq
Approx 80% complete for FB_BcatIP_R1_001_trim.fastq
Approx 85% complete for FB_BcatIP_R1_001_trim.fastq
Approx 90% complete for FB_BcatIP_R1_001_trim.fastq
Approx 95% complete for FB_BcatIP_R1_001_trim.fastq
Analysis complete for FB_BcatIP_R1_001_trim.fastq
and I can't figure out how to fix the problem or if this will affect my resulting fastqc run. I tried updating to the most recent version of fastqc but haven't been able to figure out any other troubleshooting steps.
I uploaded the html report from fastqc here: http://s000.tinyupload.com/index.php?file_id=43216373901698165438
Additionally the report itself looks pretty unusual I think. The per base quality goes down pretty dramatically at around the 30th position. It looks somewhat similar to the fastqc fail report posted here: fastqc fail
Could anyone tell me why I am receiving this "too many tiles" notification and if I am missing something about running fastqc that could account for the unusual per base sequence quality (or if this just appears to be a bad sequencing run)?
Are you running absolutely the latest version of FastQC? If this is NovaSeq data you would need the latest version to account for the additional tiles on an S4 flowcell.
Thanks, yes I did just update to the most recent version of FastQC (11.7) to try to fix the error.
This seems to be 2-color chemistry data (NextSeq, NovaSeq) and does not look normal. Even though the data has not been trimmed it appears to have Q scores cut off at Q12. The per base sequence content plot looks odd and could be indicative of many clusters showing no signal leading to "G" calls.
Have you talked with the facility that generated this data to see if there was any problem during the run? Are there other samples in this pool and do they all look like this?
It is from a Novaseq, could you please tell me how you knew that it was 2-color chemistry data?
The facility that did the sequencing did have a fluidics issue that was resolved and I wasn't informed about any issues with the quality of the run from this data. The 5 other samples in this pool all appear similar. Could this be a sequencing issue? I'm following up with them now regarding this issue.
If there was a fluidics issue during the run then this is definitely a sequencing issue. Generally Illumina replaces kits for free in such cases (as long as there is a maintenance contract in place) so the facility should be able to re-run your samples at no charge. I am surprised they even released this data in first place.
Rising %G calls indicates that this could be a 2-color run. No signal = G calls with that chemistry.
Hi there,
I'm sure you're concerned that there's an error at all, since your invocation of the command was valid and your hoping for the best from your dataset... But you haven't actually told us what you want to do. We're guessing that you're just doing routine qc with fastqc and not investigating tiling quality per-se.... most users aren't typically concerned with instrument/chip-level quality failures/biases. Sure they're useful... but they mostly show you that your chip wasn't smudged and didn't have artifacts.
I guess I'd say how is your per-base quality?? (I can't look at your file from where I am) If your quality is otherwise okay... then you can probably safely ignore concerns about tile-specific artifacts and chip concerns.
On the other hand, you might have quality issues like you've mentioned regarding the 30th position. Is your immunoprecipitation done in a way that removing some trailing bases from some of the reads (and maybe filtering out low-average-quality reads) can still give you relevant sequencing data? I wouldn't worry so much about this warning from fastqc and focus instead on what you want to learn about your samples from the histograms and distributions...
Thanks, sorry I wasn't more clear. I'm just trying to determine if I was doing something obviously wrong with fastqc or if this notification indicates that something is wrong with the file itself?
I'm concerned that my per-base quality looks quite similar to the fastqc fail per-base quality (where there is a technical failure of the sequencer itself). I'm not very familiar with reading these reports so I was hoping I could get someone to comment on how the per base sequence quality looked in light of this fastqc fail image. I was thinking that I would do as you suggest and trim the trailing bases after the 30th position.
When the sequencer takes a high-resolution image of the microchip during each cycle, it records fluorescent intensities across the different channels and therefore each residue/fluorophore. It is physically/chemically possible for background or non-specific flourescence to be present on a site that hasn't even been decoupled yet (deprotected, etc. whatever word you want to use for the removal of the extension inhibitor)! When there is signal from multiple channels, it decreases the probability that the base was called correctly, and of course a completely ambiguous base would have 25% signal on all channels.
More common, however, is a little bit of bleed through or noise in the other channels such that the signal is non-zero on the primary channels, but fairly obvious that the base can be called as 'X' (whatever X may be). However, they include an empirical probability of miscall, based on the background signal during image processing. That is the phred score. So from an instrumental perspective, it doesn't matter if it's chip-smudge, residue mutation or ambiguous pairing, or chemical contamination. They just model the probability of miscall and include that in the phred scores for each base for each read.
So! Back to your question. My histograms look a little off i.e. the averages are low past the 30th position, or my histogram distributions look a little too large and permissive of error etc.
"Is there something wrong with my file?" No. The format is valid... the data is not optimal. But your concerned with the data, not the performance of the Fastqc program.
"So should I care about the warning?" Yes, it means that you might not be able to properly rule out tile quality without looking at another program. If the chip was bad or there are obvious tile-specific artifacts instead of uniformly dispersed low-quality bases, then that might be why your quality histogram isn't right. Perhaps you can sub-sample your fastq file? Or look at another program?
"What if there were tile issues?" You'd just filter out those reads from your fastq file. Filter reads from a specific tile, either with awk or grep, perl or python. Or any sequence toolkit that permits tile-specific filtering. Or you could convert to unmapped-SAM with bbmap's reformat.sh and then grep -v for the tile!
"What do I do when I see a bad histogram?" Well, it's complicated and the right treatment depends on a) how much data you have b) how permissive your pipelines/programs are with trimmed reads and c) your own preference/tolerance for base miscall probabilities. I mean... phred of 20 is still very accurate, all things considered. Learn what phred scores actually mean before following advice blindly.
So, essentially if you want to filter your dataset a bit to improve your quality histogram, you'll need to decide "what percentage of the read could be total garbage for me to completely throw out? 20%, 30%?" (or you can calculate the analogous average quality score for the read if your filtering tool uses that instead of a percent). Alternatively (recommended) you should use a sliding-window quality trimmer to remove bases with low phred scores, which creates heterogeneity in your read length (so you'll then want to use a minimum read length setting). You should see big improvements in the histogram after this (and hopefully retain > 80% of your reads and >90% of bases).
I would recommend Trimmomatic, which has settings for adapter removal, sliding-window quality trimming, and read filtering.
I'm wondering if the naming of the reads is off somehow, and the software is not parsing the read names properly. Can you post the first 10 lines of your fastq?