Question

Reference Database For Short Illumina Reads

1

Entering edit mode

11.7 years ago

kristjan ▴ 170

I am analysing soil samples that have been sequenced with Illumina HiSeq. After assembling paired end reads and denoising the average read length was about 80 bp. So far I have used Greengenes reference taxonomy for classification (in mothur software, method is knn), but these results are very inaccurate. What reference taxonomy and classification method do you suggest for so short environmental reads?

illumina sequencing classification • 3.1k views

ADD COMMENT • link updated 11.4 years ago by Biostar 20 • written 11.7 years ago by kristjan ▴ 170

score 3 · Answer 1 · 2013-03-13

3

Entering edit mode

11.7 years ago

vijay ★ 1.6k

Have you targeted the 16s rDNA gene?? Which variable region have you sequenced?? When your read length is not good enough as expected, its imperative to use more than one workflow to check the consistency of the results. You can try various workflows like MG-RAST, MEGAN,RDP etc. This would give you a clear picture of how good your results are.

ADD COMMENT • link 11.7 years ago by vijay ★ 1.6k

1

Entering edit mode

this is an essential detail - mothur and most other packages were developed for long reads (over 350bp) targeting the 16s region. 80 bp reads are almost certainly not long enough to be used with the same methods.

ADD REPLY • link 11.7 years ago by Istvan Albert 101k

0

Entering edit mode

Our target was the V6 region. I have also tried RDP and CREST, but these gave more unclassified reads than mothur. But thanks for the new pipelines. I will try them out.

ADD REPLY • link 11.7 years ago by kristjan ▴ 170

score 0 · Answer 2 · 2013-03-13

In just echoing what vijay and Istvan have already stated, it's important to use numerous analysis workflows/pipelines -- and within those pipelines, I would recommend using numerous clustering and classification methods to identify your reads (see here for more information). When the classification methods agree across methods and pipelines, you are on to something, especially for reads as short as the ones you possess.

There are lots of databases for sequence identification and the choice of one can drastically change your clustering and classification results. I really like the M5nr database (which is just a collection of other databases with redundant sequences removed) because it's focused on, but not entirely based, nucleotide sequences coding for proteins. The website for M5nr is located here and the Github repository is here.