First post to biostars... hope i do this right..
Hi All,
I've been fighting a bit with a an attempt at a genome assembly.
Based on reading this forum, I suspect there is an issue with the raw data, but I just want to make sure I've understood everything. Fastqc looks ok. There are some over-represented kmers, but I don't think that is the problem.
The problem seems to be I seem to have an excess of rare kmers (which seemingly indicates sequencing errors). Both ALLPATHS and ABYSS seem to be telling the same story (as they should!). But I am not sure why. I past the first few lines of the coverage.hist from ABYSS below.
These data have been quality trimmed using trim_galore. One relevant post I found suggests using QUAKE for correction. Is that still my next step? Or is it possible these data are fundamentally flawed? I guess one question is, what causes an excess of rare kmers if the quality scores of the data are very high?
Thanks!
Chris
1 1182909997
2 84699927
3 9033507
4 5000923
5 3572263
6 3223489
7 3322965
8 3843986
9 4777951
10 6183795
11 8121580
12 10580154
13 13495444
Looks like every other dataset to me. plot column1 vs column 2 and look at how many peaks you have. You will need to play with x/y axis ranges.
As Adrian says, kmer frequency histograms (for isolates) usually look like that. I recommend adapter-trimming if you have not already done so, however. You can also check for and remove contaminants, particularly human, which sometimes will contribute to low-frequency kmers.
oops, messed up plotting, one second...
Ok, here is the full plot. Is this really normal? This is from ~ 1 lane of PE of a bird genome. R isn't labelling my X axis at the moment, sorry..
That looks fine; typical diploid pattern. It's looks much more sane if you plot it on a log scale.
thanks... now I know... on to other forms of assembly trouble-shooting!
Looks like you have a diploid pattern like Brian said. Adpater trimming and quality trimming can remove the low frequency k-mers, as well as contaminants if any. Try SPAdes or QUAKE for read error correction before assembly (SPAdes does error correction and assembly as part of their pipeline).