Hello,
I am new to bioinformatics and trying to replicate a 16S rRNA analysis study. I am currently stuck at a section where the authors used UCHIME v4.2 to identify and remove chimeric sequences.
I have the following files intended as input:
- Reference database file (.fasta, ~800Mb) downloaded from SILVA
- Reads in the form of fastq.gz files (ranging from 14-18Mb)
I am using the 32-bit version of USEARCH, and there isn't enough memory to run chimera identification for any of the files.
Some of the recommended ways to reduce memory mentioned on the USEARCH website include reducing database size by clustering or splitting. How do I start going about doing this? Or are there other potential issues with my input files?
Clustering or removing redundancy can be done using
CD-HIT
(LINK). Start there.If you are trying to replicate a study doing something like this (if it was not done in the original study) is bound to lead to you not being able to reproduce the original results.