Question

How to reduce database size for USEARCH

0

Entering edit mode

20 months ago

lyehui1 • 0

Hello,

I am new to bioinformatics and trying to replicate a 16S rRNA analysis study. I am currently stuck at a section where the authors used UCHIME v4.2 to identify and remove chimeric sequences.

I have the following files intended as input:

Reference database file (.fasta, ~800Mb) downloaded from SILVA
Reads in the form of fastq.gz files (ranging from 14-18Mb)

I am using the 32-bit version of USEARCH, and there isn't enough memory to run chimera identification for any of the files.

Some of the recommended ways to reduce memory mentioned on the USEARCH website include reducing database size by clustering or splitting. How do I start going about doing this? Or are there other potential issues with my input files?

USEARCH • 728 views

ADD COMMENT • link updated 20 months ago by Darked89 4.7k • written 20 months ago by lyehui1 • 0

1

Entering edit mode

Clustering or removing redundancy can be done using CD-HIT (LINK). Start there.

If you are trying to replicate a study doing something like this (if it was not done in the original study) is bound to lead to you not being able to reproduce the original results.

ADD REPLY • link 20 months ago by GenoMax 147k

score 1 · Answer 1 · 2023-04-04

1

Entering edit mode

20 months ago

Darked89 4.7k

You can use a free 64bit usearch reimplementation: https://github.com/torognes/vsearch

Depending how many duplicated reads you have in your FASTQs, it may be worthy to collapse identical reads using clumpify from BBMap.

ADD COMMENT • link 20 months ago by Darked89 4.7k