Question

How to manage a database of sequences

0

Entering edit mode

7.9 years ago

John_Casey • 0

Hi,

I've more a general question, thus I put it in "Forum".

I've to construct a database of reference sequences. These sequences will be mainly potential contaminant sequences (viruses for example). The idea is to build on this DB a contamination detection pipeline for different sources of data type (DNA,RNA,...)

1. How to maintain this database updated ?

My first idea will be to write a daily/weekly procedure to check if any additional contaminant sequences exist in public databases (NCBI) and add them if any. As these sequences will be used for several purposes (align short reads (RNA and DNA) and more classical blast alignment. The procedure will construct the different aligner indexes (bwa, star, ...) on the fly.

Also it should be easy to align any sequences on this DB

For the format of the database I'm not sure if a SQL is a good choice. Maybe a simple XML or JSON file specifying the path to the different indexes + metadata (last index build, aligner version, etc..) is enough .. Any idea ?

2 How to validate the DB ?

My first idea was to use a negative and positive control i.e. a sample with contamination ; a sample without contamination and run the pipeline on it..

Thanks

database • 1.4k views

ADD COMMENT • link updated 7.9 years ago by GenoMax 153k • written 7.9 years ago by John_Casey • 0

1

Entering edit mode

You could setup a cronjob and bash script to run at an interval of your choice, with permissions of your choice, to wget the public FASTA from NCBI, and unzip if needed. I would recommend using makeblastdb cmd, to make a new database, replacing the old one automatically in your script once the new database in downloaded. You could identify potential contamination to how they blast to your query sequences.

ADD REPLY • link 7.9 years ago by st.ph.n ★ 2.7k

1

Entering edit mode

Only you (or those who want you to build this) can define what "contamination" means. So an automated selection of sequences would likely not be a good idea. Someone will have to vet and choose what goes into this "database".

That said why does it need to be a database. You could set this up as a "contaminant" genome and build indexes for it with your favorite NGS aligner. BBMap has a tool called bbsplit that can even bin reads for you automatically that fall into "contaminant" and "right" pools.

magicblast from NCBI could also be used for checking. So you can "reuse" your blast database for regular fasta blast jobs and fastq NGS contaminant checking.

ADD REPLY • link 7.9 years ago by GenoMax 153k

0

Entering edit mode

You might find this tutorial helpful to setup local BLAST databases. http://bioinformatics.cvr.ac.uk/blog/setting-up-automatic-blast-database-update-on-linux-servers/

ADD REPLY • link 7.9 years ago by Sej Modha 5.3k