Reference Database For Short Illumina Reads
2
1
Entering edit mode
11.7 years ago
kristjan ▴ 170

I am analysing soil samples that have been sequenced with Illumina HiSeq. After assembling paired end reads and denoising the average read length was about 80 bp. So far I have used Greengenes reference taxonomy for classification (in mothur software, method is knn), but these results are very inaccurate. What reference taxonomy and classification method do you suggest for so short environmental reads?

illumina sequencing classification • 3.1k views
ADD COMMENT
3
Entering edit mode
11.7 years ago
vijay ★ 1.6k

Have you targeted the 16s rDNA gene?? Which variable region have you sequenced?? When your read length is not good enough as expected, its imperative to use more than one workflow to check the consistency of the results. You can try various workflows like MG-RAST, MEGAN,RDP etc. This would give you a clear picture of how good your results are.

ADD COMMENT
1
Entering edit mode

this is an essential detail - mothur and most other packages were developed for long reads (over 350bp) targeting the 16s region. 80 bp reads are almost certainly not long enough to be used with the same methods.

ADD REPLY
0
Entering edit mode

Our target was the V6 region. I have also tried RDP and CREST, but these gave more unclassified reads than mothur. But thanks for the new pipelines. I will try them out.

ADD REPLY
0
Entering edit mode
11.7 years ago
Josh Herr 5.8k

In just echoing what vijay and Istvan have already stated, it's important to use numerous analysis workflows/pipelines -- and within those pipelines, I would recommend using numerous clustering and classification methods to identify your reads (see here for more information). When the classification methods agree across methods and pipelines, you are on to something, especially for reads as short as the ones you possess.

There are lots of databases for sequence identification and the choice of one can drastically change your clustering and classification results. I really like the M5nr database (which is just a collection of other databases with redundant sequences removed) because it's focused on, but not entirely based, nucleotide sequences coding for proteins. The website for M5nr is located here and the Github repository is here.

ADD COMMENT

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6