Metagenomic classification by custom database
2
0
Entering edit mode
6.3 years ago

Hi,

I'm looking for a metagenomic classifier which will classify paired end illumina reads using a custom fasta-file as a database. I have been looking at a bunch of them and they all are very concerned with taxonomy, which is not relevant - or exists - in this particular case (Centrifuge is doing a fine job of that).

Ideally the output is a table of abundance for each entry in the database.

Any such thing?

sequencing classification • 2.7k views
ADD COMMENT
3
Entering edit mode
6.3 years ago
Carambakaracho ★ 3.3k

To me it sounds all you need is some super fast blast like functionality. In case I didn't misinterpret your request try either Benjamin Buchfink's diamond (against protein database) or NCBI magicblast (against DNA database).

The abundance table then is a simple script using hash/dictionaries or something similar.

ADD COMMENT
0
Entering edit mode

Magicblast seems to be what i'm looking for, thanks a lot!

ADD REPLY
0
Entering edit mode

Hi Mikael, fyi, general good practice is to mark helpful answers with thumbs up and in case it was the correct answer with accepted answer, so others can spot the best solutions faster

ADD REPLY
1
Entering edit mode
6.3 years ago

You could try Kaiju. Check the github page here

Read under custom database section

ADD COMMENT
0
Entering edit mode

Hi Vijay,

I looked at Kaiju, but it requires NCBI taxon identifiers for custom databases, which i don't have.

ADD REPLY
0
Entering edit mode

You can find all sorts of downloads related to NCBI Taxonomy here.

ADD REPLY
0
Entering edit mode

Hi genomax,

the issue is that i have no taxonomy in my own database, they potentially have arbitrary and anonymous headers.

ADD REPLY
0
Entering edit mode

Hey, I just checked it again and it says that it doesn't need the taxonomic classification. So may be you can give it a try.

ADD REPLY
0
Entering edit mode

I'm not sure, it says

It is also possible to make a custom database from a collection of protein sequences. The format needs to be a FASTA file in which the headers are the numeric NCBI taxon identifiers of the protein sequences, which can optionally be prefixed by another identifier (e.g. a counter) followed by an underscore, for example:

Am I misunderstanding something?

ADD REPLY

Login before adding your answer.

Traffic: 2850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6