Question

How to repair *all* problems identified by FastQC?

0

Entering edit mode

7.0 years ago

dec986 ▴ 380

hello,

I am downloading public data, and am running FastQC on a number of FASTQ files I've downloaded. I get reports like this:

PASS    Basic Statistics    SRR2637682_1.fastq.bz2
PASS    Per base sequence quality   SRR2637682_1.fastq.bz2
PASS    Per tile sequence quality   SRR2637682_1.fastq.bz2
PASS    Per sequence quality scores SRR2637682_1.fastq.bz2
FAIL    Per base sequence content   SRR2637682_1.fastq.bz2
FAIL    Per sequence GC content SRR2637682_1.fastq.bz2
PASS    Per base N content  SRR2637682_1.fastq.bz2
PASS    Sequence Length Distribution    SRR2637682_1.fastq.bz2
FAIL    Sequence Duplication Levels SRR2637682_1.fastq.bz2
WARN    Overrepresented sequences   SRR2637682_1.fastq.bz2
PASS    Adapter Content SRR2637682_1.fastq.bz2
FAIL    Kmer Content    SRR2637682_1.fastq.bz2

I've read about lots of quality control tools that can fix some of these problems. However, I cannot find one that works properly and generates a "PASS" for all of these.

For example, I have absolutely no idea how I can fix the "Kmer content" module, all I know is that it has always shown a FAIL in every real example I've seen.

All I can find are trimmers and adapter removers, which don't improve most of the modules here. For example, "Per base sequence content" I have no idea how to fix this, all I know is that it's always FAIL.

FastQC doesn't actually fix anything, how can I go about fixing all of these modules? are there some that okay to fail?

RNA-Seq FastQC • 7.3k views

ADD COMMENT • link updated 7.0 years ago by Ian 6.1k • written 7.0 years ago by dec986 ▴ 380

3

Entering edit mode

Some "problems" are not problems. For example:

FastQC will flag fail for most RNAseq libraries, because its assumption for fail is genomic library.
Illumina TruSeq RNAseq library will always flag fail for per base sequence content

You have to take FastQC warnings and fails with a grain of salt, taking into account the nature of the samples being analysed.

P.S.: added link for post discussing TruSeq hexamer priming problem.

ADD REPLY • link 7.0 years ago by h.mon 35k

1

Entering edit mode

Nextera genomic libraries also fail the "per base sequence content", at least they did a few years back.

I believe that was because of some residual transposase bias in the first 10-15 bp.

ADD REPLY • link 7.0 years ago by Cliff Beall ▴ 480

1

Entering edit mode

There are a lot of posts in Biostars about Fastqc For example:

Questions regarding proprocess for raw data and usage of FastQC

What's wrong with this sample? (kmers found by FastQC of RNA-Seq)

Understanding Fastqc Output- Please Help

GC content and Kmer

etc

ADD REPLY • link 7.0 years ago by natasha.sernova ★ 4.0k

score 7 · Answer 1 · 2017-12-10

Easy: You download the tool FixReadsForGood.pl and select option --no-more-worries.

Just kidding!

Yes, there are usually some warnings that you can ignore. Quality control is entirely based on your knowledge of the sequences and your purposes. In my opinion, people more often than not unnecessarily filter/trim and lose information.

score 2 · Answer 2 · 2017-12-12

A good way to solve the errors (taking into account what the other said about their relevance) is to run the reads through a trimming tool, such as Trimmomatic, cutadapt, etc. Not only will poor quality reads/bases be removed, but also adapters. Often rerunning fastqc will show a vast improvement.

Also, take a read of the excellent QCfail.