In a plant genome project I got a draft assembly (> 500Mbp, >500k contigs). A number of contigs is no doubt bacterial in origin.
There are at least 3 peaks when it comes to GC content (40% - my plant, 50% largest contig, 65-70% another group).
Blastn takes ages, and there is no point of doing it every time we change assembler parameters even slightly. So while rather sooner than later I will have to split 454 sff files into my_plant
vs not_my_plant
, I will still need a faster method of classifying contigs to not_my_plant
group.
In metagenomics this is often being done by calculating k-mer frequencies, see i.e (not supported anymore) TETRA: http://www.megx.net/tetra/ (see the manual for the algorithm)
Do you use any program for fast clustering/classification of sequences from say 150bp to 1Mbp using k-mer frequencies?
Thank you, that's a very useful tip.
It is quite fast to count all 4-mers in my draft sequence. I think I have to count them on a reverse strand and add it to forward counts (done). For implementing TETRA-measure I will start with RPy.