GC content and Kmer
2
2
Entering edit mode
9.2 years ago
Xin ▴ 70

Hi all ~~

I'm qualifying my reads for further RNA-seq analysis.

When I did fastqc for the first time, I got failure in

  • Per base sequence content
  • Per sequence GC content
  • Sequence duplication levels
  • Overrepresented sequences
  • Kmer content

Then I checked my over-represented sequences in blast and those over-represented sequences were for chloroplast genome. So I eliminated all chloroplast genome from my data.

Then I did fastqc again and I got failure in all I mentioned above except Over-represented sequences. Still I had 4 failures. After I read through websites I understood failure in Sequence duplication levels is Ok because they might be due to highly expressed transcripts. So I felt happy for Sequence duplication levels.

On the other hand, the failure in Per base sequence content was due to first 13 bases. So I trimmed them and this module also healed.

But still I have failure in Per sequence GC content and Kmer content. Kmer content is even worse than before and the peaks are all over the positions while at the beginning peaks were around first 12 positions (which I trimmed them).

To make myself sure of lack of adapter sequences (The adapter content in Fastqc is completely alright) I put all adapter sequences used by Illumina in a file and tried to trimmed them off but as I knew there were no adapters and nothing changed in quality.

So do you have any suggestion for having more qualified reads? Did I do something wrong? Why Kmer became worse?

RNA-Seq • 5.7k views
ADD COMMENT
18
Entering edit mode
9.2 years ago

For RNAseq datasets one typically sees failures in those modules. I should point out that FastQC is really geared toward whole-genome sequencing. We use it for all types of datasets, but many of them will have failures in various modules that can be ignored. In particular, the modules you mentioned will commonly fail on RNAseq samples and I wouldn't worry about that. Additionally, I would encourage you to not trim off the first 13 bases that are biased. The bases are actually correct and you'll lower your mapping rate (and cause some false alignments) by doing that.

BTW, remember with RNAseq data you expect anything short and highly expressed to be sequenced many many many times. That alone will cause a shift in the GC and kmer profile and cause "failures" in these tests. Don't worry about stuff like that.

ADD COMMENT
0
Entering edit mode

hi Devon Ryan, I'm working on RNA_Seq data from Plasmoduim Falciparum. Can I do my mapping with this QC without trimming? [FAIL]Per base sequence content [FAIL]Per sequence GC content [WARN]Sequence Length Distribution** [FAIL]Sequence Duplication Levels [FAIL]Overrepresented sequences

ADD REPLY
4
Entering edit mode
9.2 years ago
James Ashmore ★ 3.5k

Have a look at the expected and observed number of Kmers in your FastQC report. If you have say 30 million reads and your top Kmer appears only 2,000 or so times, this doesn't mean there is a problem with your library - for example in a ChIP-seq dataset you may see the binding motif come up as an enriched Kmer which would make sense. For RNA-seq I imagine there is something similar, such as sequencing the same region of the most highly expressed gene. Just a thought, although others may have better recommendations.

ADD COMMENT

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6