I am analysing soil samples that have been sequenced with Illumina HiSeq. After assembling paired end reads and denoising the average read length was about 80 bp. So far I have used Greengenes reference taxonomy for classification (in mothur software, method is knn), but these results are very inaccurate. What reference taxonomy and classification method do you suggest for so short environmental reads?
Have you targeted the 16s rDNA gene?? Which variable region have you sequenced?? When your read length is not good enough as expected, its imperative to use more than one workflow to check the consistency of the results. You can try various workflows like MG-RAST, MEGAN,RDP etc. This would give you a clear picture of how good your results are.
this is an essential detail - mothur and most other packages were developed for long reads (over 350bp) targeting the 16s region. 80 bp reads are almost certainly not long enough to be used with the same methods.
Our target was the V6 region. I have also tried RDP and CREST, but these gave more unclassified reads than mothur. But thanks for the new pipelines. I will try them out.
In just echoing what vijay and Istvan have already stated, it's important to use numerous analysis workflows/pipelines -- and within those pipelines, I would recommend using numerous clustering and classification methods to identify your reads (see here for more information). When the classification methods agree across methods and pipelines, you are on to something, especially for reads as short as the ones you possess.
There are lots of databases for sequence identification and the choice of one can drastically change your clustering and classification results. I really like the M5nr database (which is just a collection of other databases with redundant sequences removed) because it's focused on, but not entirely based, nucleotide sequences coding for proteins. The website for M5nr is located here and the Github repository is here.
this is an essential detail - mothur and most other packages were developed for long reads (over 350bp) targeting the 16s region. 80 bp reads are almost certainly not long enough to be used with the same methods.
Our target was the V6 region. I have also tried RDP and CREST, but these gave more unclassified reads than mothur. But thanks for the new pipelines. I will try them out.