Say I have a million 454 reads generated from microbes residing from the gut of a previously unknown (unsequenced) insect. There is a high likely-hood that sequences originating from insect DNA have contaminated the sequenced data set. How do I identify and remove these sequences ? Dont suggest BLAST. Please suggest approaches which do not need high compute power.
It's possible that there is contamination but it's hardly relevant because these reads might not align to microbial reference database anyways. I don't understand why you cannot blast because you will have to do that anyway. Btw. a computer that can carry out a blast of so few 454 reads is not so expensive anyways. If you do not have the ability to blast your reads you simply dont't have the compute power available you require to carry out your analysis, and the best recommendation is to aquire this first.
Hey this is akin to saying "BLAST and MEGAN are the last and the only resort". I was expecting this question to spawn a few novel ideas that help researchers in resource poor settings. In other words, metagenomics analysis seems currently restricted to groups that can afford huge compute resources. I guess people should be developing alignment free approaches that do not need an all vs all blast. Guess this should be possible in some manner.