hello everyone, I am trying to implement a expression-based tool to process csv-like files. It is likely awk, but it's more faster. And I also trying to apply this tool to process fasta/fastq/bed/sam like bioawk. My plan is to add some simple but efficient functions to support my idea. If you have any suggestion, please free let me know.
filterx
repo: https://github.com/dwpeng/filterx
A simple and lightweight tool for filtering csv file. It provides a simple way to filter csv file by column name and value. It is useful for filtering large csv files expescially when you need multiple filters. It is written in Rust and compiled to a single binary file. It is fast and efficient.
Installation
Cargo
cargo install --git https://github.com/dwpeng/filterx
Pre-compiled binary
There is no need to install Rust toolchain if you just want to use the tool. The pre-compiled binary is available for Linux. Download the pre-compiled binary from the release page.
Quick Start
it provides a simple expression language (like a subset of python programe language) to filter the csv file. There is a simple csv file data.csv
as an example.
a,b,c
1,"a","apple"
2,"b","banana"
3,"b","dog"
4,"b","cat"
5,"e","elephant"
6,"f","fish"
7,"g","goat"
8,"h","horse"
9,"i","iguana"
10,"j","jaguar"
Filter by column name and value
# -H means the file has a header
filterx data.csv -H "a == 1"
The output is
a,b,c
1,"a","apple"
Filter by multiple conditions
filterx data.csv -H "a > 1 and b == 'b'"
The output is
a,b,c
2,"b","banana"
3,"b","dog"
4,"b","cat"
Filter by a list of values
filterx data.csv -H "b in ('b', 'f')"
The output is
a,b,c
2,"b","banana"
3,"b","dog"
4,"b","cat"
6,"f","fish"
it also supports filter by a file which contains a list of values. For example, there is a file filter.txt
which contains the following values.
b
f
# $1 means the first column in the file
filterx data.csv -H "b in 'filter.txt$1'"
Column choosing
if one csv file has header, you can directly use the column name to filter the file. If the file does not have a header, you can use the column index to filter the file. For example, the following command filters the file by the first column.
# col(1) means the first column
filterx data.csv "col(1) > 1"
Create a new column
filterx data.csv -H "alias('new') = 1;alias('fk') = a + 5"
The output is
a,b,c,new,fk
1,"a","apple",1,6
2,"b","banana",1,7
3,"b","dog",1,8
4,"b","cat",1,9
5,"e","elephant",1,10
6,"f","fish",1,11
7,"g","goat",1,12
8,"h","horse",1,13
9,"i","iguana",1,14
10,"j","jaguar",1,15
Select columns to output
filterx data.csv -H "a > 1 and b == 'b';select(a)"
The output is
a
2
3
4
It also can used to control the order of columns.
filterx data.csv -H "a > 1 and b == 'b';select(c, a)"
The output is
c,a
"banana",2
"dog",3
"cat",4
Drop columns
filterx data.csv -H "a > 1 and b == 'b';drop(a)"
The output is
b,c
"b","banana"
"b","dog"
"b","cat"
Expression Language
The expression language is a subset of python language. It supports the following operators and functions.
Operators
==
: equal!=
: not equal>
: greater than>=
: greater than or equal<
: less than<=
: less than or equaland
: logical andor
: logical orin
: in a list/filenot in
: not in a list/file
Functions
col(n)
: get the value of the n-th columnalias(c)
: create column with name cselect(c, ...)
: select column with name c to output, support multiple columnsdrop(c, ...)
: drop column with name c, support multiple columns
Future
I am a graduate student studying bioinformatics and I need to filter large csv files regularly. I will add more features to this tool. If you have any suggestions, please let me know.
Thanks for making the effort of writing such a tool!
I can see no harm of having multiple tools doing sort of the same thing...I mean Novoalign was nice, but I am happy that people didn't stop innovating there ;-) Building it on top of Polars is a clever choice to get memory-efficient and fast processing of the files without reinventing too many wheels.
A feature I suggest adding is an interactive preview mode. As others have pointed out, there are numerous tools that can be used on tabular data for filtering and reformatting. But I do not know any that would allow to iteratively and interactively build your expressions. You will always have to run the command and pipe it to
head
or something like that to see if it actually did what you wanted.If
filterx
had the option to be launched on a file, display a text prompt to enter the expression and while typing already show a preview of the first ...say... 5 rows of the result, it would have a nice UX. Once you are happy, hit ENTER and the whole file is processed. You could probably reuse some code of TidyViewer for a nicely rendered preview.PS: If you are looking to automatically build
filterx
also for other platforms...feel free to grab the Github Actions CI pipeline from umi-transfer.Thank you so much for your encouragement. To be honest, I haven't received such positive feedback in a long time.
As you mentioned above, the interactive mode helps the user get the result of the expression quickly, which is really helpful. However, there are some other problems, such as filtering based on the average value of a column (currently not supported by filterx), and calculating the average value requires reading the entire csv file into memory and then calculating it. This could introduce new efficiency problems. Over the next few days, I will try to add an interactive mode. Another good UX interface, such as the use of simple TUI implementation, just like the C language in the GDB debugging tool, can greatly save the user's mental burden.
Thank you for providing me with a github action. In fact, I also used the cross tool in another project, but I was worried that using MUS-libc would affect the efficiency of the program, so I chose zig-build. However, there are compilation issues on windows and mac, and I may use cross instead of zig-build for the next version to provide a binary version of linux/windows/macos.
You are more than welcome! It is a well deserved encouragement, because developing (and later maintaining) an open-source tool is tedious, and I am grateful for each of them, since I rely on such tools every day. Furthermore, as far as my limited Rust knowledge allows me to judge, your code is well-structured and cleanly written, so it is also a good codebase.
No rush with implementing the preview mode, but yes, making use of some TUI features could really set
filterx
apart from other tools. In see the issue with efficiency problems as well and would address those either by making it optional (subcommand likefilterx interactive
or a flag likefilterx -i
) or disable certain functionality.For example, filtering entire columns on computed summarized values to me seems to an edge case, because most tabular data is inherently row-oriented? The information in a particular column may not be interesting for a particular purpose, but then I will just
cut
it entirely based on column name or number. I think, for most practical filtering purposes, it would be enough to process one row at a time, or just a few rows in a sliding-window manner.But that is actually a subject a forum like this could easily help you figuring out. You could ask the people what kind of task they repeatedly struggle with doing with tools like
awk
and then think about how this operation could be simplified usingfilterx
instead.I would first of all check that the functionality is novel and not redundant to efficient and established tools like https://bioinf.shenwei.me/seqkit/usage/, samtools and tabix.
I think you meant to refer to
csvtk
toolkit which is analogous to the tool being described here (also from Wei Shen author ofseqkit
): https://bioinf.shenwei.me/csvtk/As long as a tool works as advertised having more options is not a bad thing, even though the functionality may be somewhat redundant..
Thank you for your suggestion. My tool now is just a framework with basic component. I will try to add more function so that this tool can meet more filtering requirement. Like filter sequence by length or gc.
A lot of text-based bioinformatics files will store data with tabs or other delimiters than commas. It might be useful to provide options to let the user specify different field and record delimiters, as awk does.
yse, you are right. This tool support different delimiters, commas is default choice.
You may check DuckDB or xsv. One can do i.e.:
This is not to discourage you to continue with your project (
xsv
replacement able to handle TSVs could be handy).Thanks your example with duckdb and xsv. I have noticed duckdb for a while, and I also used it. To be honest, duckdb is my one of my favorite tools that can process row-based data with just simple SQL expression. The reason I want to make new tool because those tools are designed to process a more standard files like sql/csv/tsv/json, not designed for bioinformatics. So I want to use this tools to implement a more friendly to bioinformatics.