Question

Should I remove Kmers identified in FastQC?

8

Entering edit mode

9.6 years ago

kezcleal ▴ 160

Hi, apologies for this basic question as I am new to the field. I have been checking my NGS data using FastQC and the checks fail on the Kmer content section. There appears to be a sequence TAGATCGGAA at position 90-100 bp in the reads which is enriched around 12 fold (Obs/Exp Max). However nothing shows up in the 'Overrepresented sequences' or 'Adapter content' sections of the report (complete flat line). The sequencing was done as BGI and I do not know what primers were used. My question is should I be trying to remove this Kmer sequence?

screenshot

If I use grep on the first million reads in my fastq file to look for the sequence I only find 36 which appears to be quite a low number?

$ gunzip -c data1.fq.gz | head -4000000 | grep TAGATCGGAA | wc -l

Thanks for the help

next-gen genome FastQC • 8.6k views

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.6 years ago by kezcleal ▴ 160

2

Entering edit mode

This plot in general and that fold change and pvalue in particular are not informative and, in the vast majority of cases cause only confusion.

This plot should be removed from the output of this tool.

ADD REPLY • link 8.3 years ago by Istvan Albert 102k

0

Entering edit mode

Agreed, I really wish that FastQC had big warnings above their plots like "this is only meaningful for whole-genome sequencing".

ADD REPLY • link 8.3 years ago by Devon Ryan 105k

0

Entering edit mode

enter image description here

I have ChIPseq data with a similar problem, the quality scores are good for this data, so I aligned the reads using bowtie and used macs2 for peak calling and only when I use this particular sample (input) I do not get peaks for my IP samples. Using a different input will yield peaks for all IP-samples. I will appreciate any feedback.

ADD REPLY • link updated 24 months ago by Ram 44k • written 8.5 years ago by ATCG ▴ 400

1

Entering edit mode

Try doing a PCA or clustering of the various input and ChIP samples (e.g., with deepTools), perhaps you have a sample swap.

ADD REPLY • link 8.5 years ago by Devon Ryan 105k

0

Entering edit mode

ChIPseq sample PCA using deepTools

B=control; inputB is the input for this sample
P=treatment ; inputP is the input for this sample
IP for B1 is unusual, could you please provide your feedback.

ADD REPLY • link updated 24 months ago by Ram 44k • written 8.3 years ago by ATCG ▴ 400

0

Entering edit mode

enter image description here

ADD REPLY • link updated 24 months ago by Ram 44k • written 8.3 years ago by ATCG ▴ 400

0

Entering edit mode

I suspect that the outlier sample has signal mostly at regions that should be blacklisted. Either way, you have one weird sample that might need to be excluded, that happens.

ADD REPLY • link 8.3 years ago by Devon Ryan 105k

0

Entering edit mode

Hi @kezcleal

I see a very similar kmer plot with exact same kmers showing up as over-represented. I am wondering if you were able to find the cause of these. My data is Agilent exomes so I am wondering if there could be adapters/baits/coltrol sequences in the target enrichment kit that are popping up.

Thanks!

ADD REPLY • link updated 24 months ago by Ram 44k • written 8.2 years ago by rachanaj • 0

Ram · Accepted Answer · 2015-05-29

7

Entering edit mode

9.6 years ago

Devon Ryan 105k

No, there's no point in removing that. It's likely that that's just a bit of adapter contamination (I think the adapter contamination section of FastQC is looking for closer to full-length adapters, rather than just 5-8 bases). Go ahead and trim adapters regardless.

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.6 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks for the response. I also have the problem of not knowing what adapters to try to trim, as this sequence doesn't seem to correspond to the well known adapters e.g. Illumina universal, Nextera, Truseq etc. Should I run something like trimmomatic and give it all of these potential adapter sequences anyway?

ADD REPLY • link updated 24 months ago by Ram 44k • written 9.6 years ago by kezcleal ▴ 160

4

Entering edit mode

If you have paired reads, you can find out what the adapter sequences are with BBMerge:

bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa reads=1m

You can also adapter-trim using BBDuk without knowing the adapter sequence using the tbo (trimbyoverlap) flag, but it's best to use both the tbo flag AND the adapter sequence.

ADD REPLY • link updated 24 months ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Sure, or trim_galore, since its default sequence will probably work.

ADD REPLY • link updated 24 months ago by Ram 44k • written 9.6 years ago by Devon Ryan 105k

0

Entering edit mode

Percentage of those sequences with partial adapters are low. In my case ca. 6%. I think that its best to identify and remove those reads -after all falsely duplicated sequences could somehow mess eg. CHiP-seq or DNAse-seq results, am I right- but trimmomatic doesn't allow to do that.

ADD REPLY • link 8.2 years ago by boczniak767 ▴ 870

0

Entering edit mode

Don't bother deduplicating before alignment. It's faster and easier post-alignment.

ADD REPLY • link 8.2 years ago by Devon Ryan 105k

0

Entering edit mode

I mean, k-mers could potentially influence mapping to genome, making reads with partial-adapter sequences generating 'adapter' peaks. Or it's too short to influence mapping?

ADD REPLY • link 8.2 years ago by boczniak767 ▴ 870

0

Entering edit mode

If the k-mers aren't part of an adapter (or something else artificially added) then they shouldn't be removed. They won't bias mapping because they should be there.

ADD REPLY • link 8.2 years ago by Devon Ryan 105k

Ram · Accepted Answer · 2015-05-29

2

Entering edit mode

9.6 years ago

cyril-cros ▴ 950

FastQC often displays failed checks - it does not mean your data is bad, they are just warnings.

The only thing you should always really check is the quality along the length of your reads, which might force you to do some trimming. In this case, I would say that your reads are ok as they are, and you can proceed to the alignment phase of your workflow.

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.6 years ago by cyril-cros ▴ 950

0

Entering edit mode

The per base quality seems to be good I think, (yellow boxes all in the green region). Thanks

ADD REPLY • link updated 24 months ago by Ram 44k • written 9.6 years ago by kezcleal ▴ 160