Kmer Content in FastQC failed
3
2
Entering edit mode
9.0 years ago
hurtc.stri ▴ 20

Hi All,

I am completely new to NGS analysis. I just received data from a paired-end 125 bp Illumina run. I ran both data files through the FastQC software to check for data quality and received a few warnings/failures. In the graph Both runs failed the kmer content test. In the graph, all of the lines hover around zero until you get the right side of the graph (near 94-96) where all of the lines exponentially increase to about 9. Does this mean that there could be adaptors left over on the 5' end? I'm not sure exactly how to interpret this or what I should do about it. Thank you in advance for any advice/suggestions.

next-gen-sequencing • 26k views
ADD COMMENT
18
Entering edit mode
9.0 years ago
Amitm ★ 2.3k

hi,

Do not worry too much about the k-mer plot. If your -

1) Per base seq. qual plot is OK with most of the boxes in the green zone and maybe a couple of towards the end falling below the green

2) Per base seq. content plot has the 4 lines overlapping each other. Sometimes you might see that for the first ~10bases or so, the lines are noisy but from there on they should smoothen out.

3) Per seq. GC content plot has single bell shaped hump (more or less) and not two or more. Small shoulders are ok

4) If you see adapters in the Adapter content plot, do adapter removal (and maybe trim reads from the end as well for getting rid of low qual bases) and then redo FastQC.

If all above 3 plots are on those lines and adapters have been removed, you are good to go.

I do not know for what reasons k-mers show enrichment but if its towards the end of read length then it could be due to low qual bases / adapters.

Caveat - The above generalizations are for WES/ WGS/ RNA-seq data. If you have low volume data like from amplicon sequencing you would see a lot of noise as per FastQC. Like the GC plot might have multiple shoulders but this would be due to low diversity in your data (a handful of genes).

ADD COMMENT
0
Entering edit mode

Thank you so much for the tips. The per base sequence content plot did issue a warning. The A/T lines overlap at around 30% and the G/C lines overlap at around 20%. I assumed this means that my genome is A/T biased?

ADD REPLY
0
Entering edit mode

Yes, from what you are saying it seems so. You can get a better idea buy looking into the 'Basic Stats' section. It gives you GC% as well.

ADD REPLY
1
Entering edit mode
8.2 years ago

I have the same problem, also with 125bp paired end Illumina sequencing. The issue appears to be a shorter than desired fragment size distribution leading to adapter read-through on ~10k reads.

ADD COMMENT
0
Entering edit mode

It happens. Use an appropriate trimming program and go on to the next step.

ADD REPLY
0
Entering edit mode
9.0 years ago
dally ▴ 210

This has some good examples: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Then you can look into the following: Trimmomatic, Fastx-toolkit to trim data.

Goodluck!

ADD COMMENT
0
Entering edit mode

Great tutorial - thanks for the link. Now onto Trimmomatic!

ADD REPLY

Login before adding your answer.

Traffic: 1646 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6