Hi,
I've more a general question, thus I put it in "Forum".
I've to construct a database of reference sequences. These sequences will be mainly potential contaminant sequences (viruses for example). The idea is to build on this DB a contamination detection pipeline for different sources of data type (DNA,RNA,...)
1. How to maintain this database updated ?
My first idea will be to write a daily/weekly procedure to check if any additional contaminant sequences exist in public databases (NCBI) and add them if any. As these sequences will be used for several purposes (align short reads (RNA and DNA) and more classical blast alignment. The procedure will construct the different aligner indexes (bwa, star, ...) on the fly.
Also it should be easy to align any sequences on this DB
For the format of the database I'm not sure if a SQL is a good choice. Maybe a simple XML or JSON file specifying the path to the different indexes + metadata (last index build, aligner version, etc..) is enough .. Any idea ?
2 How to validate the DB ?
My first idea was to use a negative and positive control i.e. a sample with contamination ; a sample without contamination and run the pipeline on it..
Thanks
You could setup a
cronjob
and bash script to run at an interval of your choice, with permissions of your choice, towget
the public FASTA from NCBI, and unzip if needed. I would recommend usingmakeblastdb
cmd, to make a new database, replacing the old one automatically in your script once the new database in downloaded. You could identify potential contamination to how they blast to your query sequences.Only you (or those who want you to build this) can define what "contamination" means. So an automated selection of sequences would likely not be a good idea. Someone will have to vet and choose what goes into this "database".
That said why does it need to be a database. You could set this up as a "contaminant" genome and build indexes for it with your favorite NGS aligner. BBMap has a tool called
bbsplit
that can even bin reads for you automatically that fall into "contaminant" and "right" pools.magicblast
from NCBI could also be used for checking. So you can "reuse" your blast database for regular fasta blast jobs and fastq NGS contaminant checking.You might find this tutorial helpful to setup local BLAST databases. http://bioinformatics.cvr.ac.uk/blog/setting-up-automatic-blast-database-update-on-linux-servers/