Question

Tool:A simple and lightweight tool for filtering csv file wirtten in Rust

2

Entering edit mode

8 months ago

dwpeng ▴ 120

hello everyone, I am trying to implement a expression-based tool to process csv-like files. It is likely awk, but it's more faster. And I also trying to apply this tool to process fasta/fastq/bed/sam like bioawk. My plan is to add some simple but efficient functions to support my idea. If you have any suggestion, please free let me know.

filterx

repo: https://github.com/dwpeng/filterx

A simple and lightweight tool for filtering csv file. It provides a simple way to filter csv file by column name and value. It is useful for filtering large csv files expescially when you need multiple filters. It is written in Rust and compiled to a single binary file. It is fast and efficient.

Installation

Cargo

cargo install --git https://github.com/dwpeng/filterx

Pre-compiled binary

There is no need to install Rust toolchain if you just want to use the tool. The pre-compiled binary is available for Linux. Download the pre-compiled binary from the release page.

Quick Start

it provides a simple expression language (like a subset of python programe language) to filter the csv file. There is a simple csv file data.csv as an example.

a,b,c
1,"a","apple"
2,"b","banana"
3,"b","dog"
4,"b","cat"
5,"e","elephant"
6,"f","fish"
7,"g","goat"
8,"h","horse"
9,"i","iguana"
10,"j","jaguar"

Filter by column name and value

# -H means the file has a header
filterx data.csv -H "a == 1"

The output is

a,b,c
1,"a","apple"

Filter by multiple conditions

filterx data.csv -H "a > 1 and b == 'b'"

The output is

a,b,c
2,"b","banana"
3,"b","dog"
4,"b","cat"

Filter by a list of values

filterx data.csv -H "b in ('b', 'f')"

The output is

a,b,c
2,"b","banana"
3,"b","dog"
4,"b","cat"
6,"f","fish"

it also supports filter by a file which contains a list of values. For example, there is a file filter.txt which contains the following values.

b
f

# $1 means the first column in the file
filterx data.csv -H "b in 'filter.txt$1'"

Column choosing

if one csv file has header, you can directly use the column name to filter the file. If the file does not have a header, you can use the column index to filter the file. For example, the following command filters the file by the first column.

# col(1) means the first column
filterx data.csv "col(1) > 1"

Create a new column

filterx data.csv -H "alias('new') = 1;alias('fk') = a + 5"

The output is

a,b,c,new,fk
1,"a","apple",1,6
2,"b","banana",1,7
3,"b","dog",1,8
4,"b","cat",1,9
5,"e","elephant",1,10
6,"f","fish",1,11
7,"g","goat",1,12
8,"h","horse",1,13
9,"i","iguana",1,14
10,"j","jaguar",1,15

Select columns to output

filterx data.csv -H "a > 1 and b == 'b';select(a)"

The output is

It also can used to control the order of columns.

filterx data.csv -H "a > 1 and b == 'b';select(c, a)"

The output is

c,a
"banana",2
"dog",3
"cat",4

Drop columns

filterx data.csv -H "a > 1 and b == 'b';drop(a)"

The output is

b,c
"b","banana"
"b","dog"
"b","cat"

Expression Language

The expression language is a subset of python language. It supports the following operators and functions.

Operators

==: equal
!=: not equal
>: greater than
>=: greater than or equal
<: less than
<=: less than or equal
and: logical and
or: logical or
in: in a list/file
not in: not in a list/file

Functions

col(n): get the value of the n-th column
alias(c): create column with name c
select(c, ...): select column with name c to output, support multiple columns
drop(c, ...): drop column with name c, support multiple columns

Future

I am a graduate student studying bioinformatics and I need to filter large csv files regularly. I will add more features to this tool. If you have any suggestions, please let me know.

gtf tab gff vcf csv • 1.7k views

ADD COMMENT • link updated 8 months ago by Matthias Zepper 5.1k • written 8 months ago by dwpeng ▴ 120

4

Entering edit mode

Thanks for making the effort of writing such a tool!

I can see no harm of having multiple tools doing sort of the same thing...I mean Novoalign was nice, but I am happy that people didn't stop innovating there ;-) Building it on top of Polars is a clever choice to get memory-efficient and fast processing of the files without reinventing too many wheels.

A feature I suggest adding is an interactive preview mode. As others have pointed out, there are numerous tools that can be used on tabular data for filtering and reformatting. But I do not know any that would allow to iteratively and interactively build your expressions. You will always have to run the command and pipe it to head or something like that to see if it actually did what you wanted.

If filterx had the option to be launched on a file, display a text prompt to enter the expression and while typing already show a preview of the first ...say... 5 rows of the result, it would have a nice UX. Once you are happy, hit ENTER and the whole file is processed. You could probably reuse some code of TidyViewer for a nicely rendered preview.

PS: If you are looking to automatically build filterx also for other platforms...feel free to grab the Github Actions CI pipeline from umi-transfer.

ADD REPLY • link 8 months ago by Matthias Zepper 5.1k

2

Entering edit mode

Thank you so much for your encouragement. To be honest, I haven't received such positive feedback in a long time.

As you mentioned above, the interactive mode helps the user get the result of the expression quickly, which is really helpful. However, there are some other problems, such as filtering based on the average value of a column (currently not supported by filterx), and calculating the average value requires reading the entire csv file into memory and then calculating it. This could introduce new efficiency problems. Over the next few days, I will try to add an interactive mode. Another good UX interface, such as the use of simple TUI implementation, just like the C language in the GDB debugging tool, can greatly save the user's mental burden.

Thank you for providing me with a github action. In fact, I also used the cross tool in another project, but I was worried that using MUS-libc would affect the efficiency of the program, so I chose zig-build. However, there are compilation issues on windows and mac, and I may use cross instead of zig-build for the next version to provide a binary version of linux/windows/macos.

ADD REPLY • link 8 months ago by dwpeng ▴ 120

0

Entering edit mode

You are more than welcome! It is a well deserved encouragement, because developing (and later maintaining) an open-source tool is tedious, and I am grateful for each of them, since I rely on such tools every day. Furthermore, as far as my limited Rust knowledge allows me to judge, your code is well-structured and cleanly written, so it is also a good codebase.

No rush with implementing the preview mode, but yes, making use of some TUI features could really set filterx apart from other tools. In see the issue with efficiency problems as well and would address those either by making it optional (subcommand like filterx interactive or a flag like filterx -i) or disable certain functionality.

For example, filtering entire columns on computed summarized values to me seems to an edge case, because most tabular data is inherently row-oriented? The information in a particular column may not be interesting for a particular purpose, but then I will just cut it entirely based on column name or number. I think, for most practical filtering purposes, it would be enough to process one row at a time, or just a few rows in a sliding-window manner.

But that is actually a subject a forum like this could easily help you figuring out. You could ask the people what kind of task they repeatedly struggle with doing with tools like awk and then think about how this operation could be simplified using filterx instead.

ADD REPLY • link 8 months ago by Matthias Zepper 5.1k

3

Entering edit mode

I would first of all check that the functionality is novel and not redundant to efficient and established tools like https://bioinf.shenwei.me/seqkit/usage/, samtools and tabix.

ADD REPLY • link 8 months ago by ATpoint 88k

6

Entering edit mode

I think you meant to refer to csvtk toolkit which is analogous to the tool being described here (also from Wei Shen author of seqkit): https://bioinf.shenwei.me/csvtk/

As long as a tool works as advertised having more options is not a bad thing, even though the functionality may be somewhat redundant..

ADD REPLY • link 8 months ago by GenoMax 151k

0

Entering edit mode

Thank you for your suggestion. My tool now is just a framework with basic component. I will try to add more function so that this tool can meet more filtering requirement. Like filter sequence by length or gc.

ADD REPLY • link 8 months ago by dwpeng ▴ 120

2

Entering edit mode

A lot of text-based bioinformatics files will store data with tabs or other delimiters than commas. It might be useful to provide options to let the user specify different field and record delimiters, as awk does.

ADD REPLY • link 8 months ago by Alex Reynolds 36k

1

Entering edit mode

yse, you are right. This tool support different delimiters, commas is default choice.

ADD REPLY • link 8 months ago by dwpeng ▴ 120

2

Entering edit mode

You may check DuckDB or xsv. One can do i.e.:

duckdb < some_filtering_code.sql

This is not to discourage you to continue with your project (xsv replacement able to handle TSVs could be handy).

ADD REPLY • link 8 months ago by Darked89 4.7k

1

Entering edit mode

Thanks your example with duckdb and xsv. I have noticed duckdb for a while, and I also used it. To be honest, duckdb is my one of my favorite tools that can process row-based data with just simple SQL expression. The reason I want to make new tool because those tools are designed to process a more standard files like sql/csv/tsv/json, not designed for bioinformatics. So I want to use this tools to implement a more friendly to bioinformatics.

ADD REPLY • link 8 months ago by dwpeng ▴ 120