Sequence filtration Program
Sequence database curator
https://github.com/Eslam-Samir-Ragab/Sequence-database-curator
This program can filter nucleotide and/or protein database from a list of names or sequences (by exact match).
Input:
File containing all the sequences in FASTA format.
Processing:
It removes specific sequences in your database by your choice.
Output:
One file of your chosen name.
Options:
Working on either protein (p) or nucleotide (n) databases.
How to use:
- You need to install python 2.7 or python 3 on your machine.
- You need to install Numpy and Biopython
- You need to install future module by pip command
- Click “Clone or download” > “Download ZIP” > extract the downloaded file.
- Open the file “sequence_filteration.py” with (python.exe).
- Windows
- U/Linux : use the command
chmod u+x database_curator.py
- Mac : use the command
python sequence_filteration.py
- State your variables and press Enter.
List of options in the program are summarized in the Read Me file
Examples
if you want to process a nucleotide sequences use the following command
python sequence_filteration.py -in (input_file) -n -out (output_file) -filter (filter_file) -flt_mode seq
if you want to process a protein sequences with optimum length approach use the following command
python sequence_filteration.py -in (input_file) -p -out (output_file) -filter (filter_file) -flt_mode seq
if you want to process a nucleotide sequences with a file containing list of exact names use the following command
python sequence_filteration.py -in (input_file) -n -out (output_file) -filter (filter_file) -flt_mode name
Related:
While this is a basic tool, you have made huge improvements in your coding style and layout in less than a month. I'm seriously deeply impressed. I hope you're enjoying it too because there's probably still a long way to go. But yeah, you obviously have a lot of passion for this, and like I said your coding style is great. Please keep it up! :)
My only advice would be to try and find a more interesting problem to solve for your next one. I don't know what your situation at work is like, but if you can find new problems to solve by asking the people around you what tools they would want in an ideal world, then that could be the perfect seed for an idea for your next tool :)
@Eslam, you did an great job of describing the installation and usage of your tool, complete with a diagram. I would suggest, though, that the diagram seems a bit unrelated to names, which is a little confusing. Also, it would help if you could describe in more detail the meaning of the flags - for example, does "-filter" specify an output file or an input file? Formatting-wise, this is an excellent post.
I suggest you modify the tool slightly to allow users to invert the selection and filter either inclusively or exclusively of the list, or else, provide two output streams, one for sequences matching the filter, and one for sequences not matching the filter. I find that in practice splitting based on name tends to be useful.
Dr @Brian, Thanks for clearing some of the defects of the previous tool. Here is the next release Sequence Dereplication and Database Curator Program (SDDC) I hope it will be satisfactory for this job to move for more realistic problems :) .
Hi Eslam,
I recommend that you update your original post - right now it says:
if you want to process a nucleotide sequences with a file containing list of exact names use the following command
...which is great - I love command lines that are easy to follow, rather than regexes and similar. For example:
So, that's a lot of information. I still have no idea what "grep [OPTION]... PATTERN [FILE]..." means, because it's some custom language understood by the developer, and not the users. Why are there ellipses? What do the brackets mean? Why does OPTION have brackets but not PATTERN? Basically, with this kind of help, without any examples of the kind of things a typical user might type, it is impossible to use without extensive trial and error. Which basically negates the purpose of documentation.
So - to make your tool friendly, I suggest you keep the documentation up to date with the current feature-set, and also write documentation such that anyone who reads it will be able to use the tool correctly - with a goal of using it correctly on the first try, even for people that are not Linux-proficient.
Note - I deleted random lines from grep's output to fit biostars' line limit.
Dr @John Thanks a lot for all of this precious advice. I'm now working on something that I hope it will have better impact than this simple idea :)