Sequence Dereplication and Database Curator Program (SDDC)
https://github.com/Eslam-Samir-Ragab/Sequence-database-curator/releases/tag/v2.0
This program can : dereplicates and/or filter nucleotide and/or protein database from a list of names or sequences (by exact match).
Version 2.0 Updates:
- You can filter the sequences using only keywords (separated by a comma) inclusively or exclusively by adding (-kw) argument to your normal command line.
- You can get your sequences in their original order after dereplication and/or sequence filtration by adding (-org_order) to your normal command line.
How to use:
- You need to install python 2.7 or python 3 on your machine.
- You need to install Numpy and Biopython
- You need to install future module by pip command
- Click “Clone or download” > “Download ZIP” > extract the downloaded file.
- Open the file “sddc.py” with (python.exe).
- Windows
- U/Linux : use the command
chmod u+x sddc.py
- Mac : use the command
python sddc.py
- State your variables and press Enter.
List of options in the program you can download it from here:
Examples
if you want to dereplicate protein sequences use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode derep
if you want to dereplicate protein sequences and preserve the original order of the sequences in the new file use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode derep -org_order
if you want to dereplicate protein sequences with a minimum length = 30 and sequences are in multiple files use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode derep -min_length 30 -multi
if you want to dereplicate nucleotide sequences with optimum approach and normal protein length = 300 use the following command
python sddc.py -in (input_file) -n -out (output_file) -mode derep -optimum -prot_length 300
if you want to filter a protein sequences inclusively by name (i.e. you want to retrieve only seqeunces that you've specified their names) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach inclusive
if you want to filter a protein sequences inclusively by keyword(s) (i.e. you want to retrieve only seqeunces that you've specified the keywords (separated by a comma) in their names) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach inclusive -kw
if you want to filter a protein sequences exclusively by name (i.e. you want to retrieve the seqeunces that aren't present in your filter file) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach exclusive
if you want to filter a protein sequences exclusively by keyword(s) in their names (i.e. you want to retrieve the seqeunces that certain keywords (separated be a comma) aren't present in your filter file) use the following command
python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach exclusive -kw
if you want to filter a nucleotide sequences by sequence (only exclusive) use the following command
python sddc.py -in (input_file) -n -out (output_file) -mode filter -flt_by seq -flt_file (filter_file)
Thank you so much for your comment.
For the option of counting sequences, it can be detected but in indirect way. Please, check the SDDC Cheat sheet. This point is covered in page 2 (grey box).
For the partial (fragmented) sequences, I tested the program against a synthetic database and a real database for both proteins and nucleotide sequences including partial (fragmented) sequences and the sensitivity and specificity were 100%.
Thanks again for your comment. I am willing to know if there is anything else needed to make this tool more comfortable for usage.