Question

K-Mer Based Sequencing Contamination Detection

12

Entering edit mode

14.8 years ago

Darked89 4.7k

In a plant genome project I got a draft assembly (> 500Mbp, >500k contigs). A number of contigs is no doubt bacterial in origin.

There are at least 3 peaks when it comes to GC content (40% - my plant, 50% largest contig, 65-70% another group).

Blastn takes ages, and there is no point of doing it every time we change assembler parameters even slightly. So while rather sooner than later I will have to split 454 sff files into my_plant vs not_my_plant, I will still need a faster method of classifying contigs to not_my_plant group.

In metagenomics this is often being done by calculating k-mer frequencies, see i.e (not supported anymore) TETRA: http://www.megx.net/tetra/ (see the manual for the algorithm)

Do you use any program for fast clustering/classification of sequences from say 150bp to 1Mbp using k-mer frequencies?

sequencing • 9.6k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 14.8 years ago by Darked89 4.7k

Ram · Answer 1 · 2010-03-06

I would recommend R/Bioconductor to do these kinds of analysis even though I personally doubt that this method is precise enough to separate the reads correctly. You can find the function oligonucleotideFrequency in the Biostrings package. The code for the first step would look somewhat like this:

library(Biostrings)
reads = read.DNAStringSet("yourReads.fas", format="fasta")
nf = oligonucleotideFrequency(reads[1:100], width=4)
hclust(dist(nf)) # do hierarchical clustering of your tetra freq.

That would be a very simple form of clustering. Then you have all the powerful classification algorithms built in R available, for example a support vector machine classifier. Create a training and test set of reads from 2 or more sequenced genomes and mix them. Then you will see if it is possible.

But if you look at your frequencies it might look like this:

     TACG TACT TAGA TAGC TAGG TAGT TATA TATC TATG TATT TCAA TCAC TCAG TCAT TCCA TCCC TCCG TCCT
[1,]    0    0    0    0    0    0    0    0    1    0    0    1    2    0    0    4    0    0
[2,]    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    0
[3,]    0    0    0    1    1    1    0    0    0    1    0    0    0    0    0    1    0    0
[4,]    0    0    1    1    0    1    0    2    0    0    1    0    2    1    0    0    1    1
[5,]    0    1    0

So, lots of 0 or 1. Maybe not enough to classify correctly. That's from some 454 reads as an example and it seems that one should try di- and tri- nucleotides as well.

Alternative: blastx on the individual reads and discard only those with good best hit to a bacterium. A few wrong reads should do no big harm, so it is maybe good not to risk to filter out too many beforehand.

Ram · Answer 2 · 2010-12-10

4

Entering edit mode

14.0 years ago

Monzoor ▴ 300

You can try out this site for the problem you have.

The software available at this site helps you in separating eukaryotic sequences from prokaryotic sequences without the need for blast alignment. It can analyze a million sequences in roughly an hour.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 14.0 years ago by Monzoor ▴ 300

Ram · Answer 3 · 2010-12-10

I'd do it the other way round.

Find contaminated contigs: take the contigs of one of your assemblies and screen those against a microbial database. Check the contigs which come out with a significant hit and, once validated it's not some spurious error, put them in a separate file.
Find contaminated reads: using the validated contigs from above, simply perform a mapping assembly of your 454 reads against those. Use stringent alignment parameters (perhaps something like 95% identity). The assembler will tell you which reads mapped and which didn't.

Well, those reads which mapped are the ones you want to exclude in the future.

One last thing: be very careful ... more often than not there's been gene transfer between plants and bacteria living close to them. This will probably lead to some perfectly valid plant reads to be wrongly sorted out.

Ram · Answer 4 · 2010-12-14

Though I agree with others that this may not be the best strategy to solve your problem, for fast and efficient computation of k-mers on large sequence databases, try Tallymer from the Kurtz lab. Tallymer is a part of the GenomeTools package, which compiles and runs very cleanly and has a number of other very nice algorithms for genome analysis. Of course this doesn't fully solve your problem, but Tallymer should allow you to quickly generate various k-mer indices and counts as input data for a clustering/classification method.