Question

Tool:Efficiently process (view, analize, clip ends, convert, demultiplex, dereplicate) SFF/FastQ files

22

Entering edit mode

11.0 years ago

BioApps ▴ 800

Hi. I am a biologist in need of a good graphic/visual/fast FastQ editor. Starting from a Biostars thread I implemented few days ago my own SFF/FastQ editor. I hope this is the most complete SFF/FastQ editor available. If you want a specific feature implemented just let me know.

Features

Supported files

SFF, FastQ, FQ, Fasta (soon)

Filters

Cut reads with average QV under specified threshold
Cut reads if they contain N bases (the user can specify how many)
Cut low complexity reads
Cut reads that are too short
Cut reads that are too long
Cut low quality ends. Automatically detect and cut low quality bases at the end of each read
Cut poly-A/T tails

Tools and converters

Dereplicate sequences (to be released soon!)
Split multiplexed files (MID/barcode splitter)
Remove contaminants (search over represented sequences against a contaminant database)
File splitter: Split huge FastQ/SFF file in chunks of x reads
File splitter: Cut all sequences in the specified range
Compact FastQ files
Convert SFF to FastQ
Convert SFF to Fasta
Convert FastQ to Fasta (multiFasta)
Convert FastQ file to a different encoding (under development)

Graphs and data analysis

Sequence viewer - Show all reads: Read name, Base sequence, average quality, sequence length
Sequence length distribution graph
Per base sequence quality graph
Per base GC content graph
Per base sequence content graph
Per base N content graph (integrated in the 'Per Base Content' graph)
Per sequence quality scores graph graph

Graphs can be expanded to full screen
All graphs are update in real time as the file is processed

Download link

Version 3.2.3 (released August 2015) can be downloaded here. The size of this program is about 4 MB. No installer needed.

Dereplication is now also available (app). Statistic data about clusters included in Dereplicator.

'Follow' this post to stay up to date.

Requirements:

<3MB of disk space
no installation
no Java
no .Net
no admin permissions
no money :)

Speed & mem footprint:

On an old Toshiba laptop (i5, 2.2GHz) it loads a 0.5GB file in under 11sec (if not processing is applied). This includes also the time for determining the file encoding (Solexa, Illumina, Sanger). The memory footprint should exceeds 15-30MB. I am thinking about doing the file decoding and the data processing in separate threads.

Your feedback

The program was built on feedback from users. So, please comment on things such as:

Feature requests
Platform you are interested in (Windows, Mac, Linux) - This is very important!
Statistics about your files (file type, how many, file size) and your working station (CPU/RAM)
Which of the already modules are you interested in (so we can improve them)
New request from users: Allow program resizing so it can fit on very small laptop screens

This tool integrates with Avalanche Workbench.

sff sequence fastq next-gen • 25k views

ADD COMMENT • link updated 21 months ago by Ram 45k • written 11.0 years ago by BioApps ▴ 800

9

Entering edit mode

Linux, Linux, Linux and Linux. Without the linux support, you are excluding >90% of potential users.

ADD REPLY • link 10.6 years ago by lh3 33k

4

Entering edit mode

And OS X. Biased sample, maybe, but I just don't see too many folks here with Windows laptops doing informatics work. It's all Linux and OS X for real work.

ADD REPLY • link 10.6 years ago by Alex Reynolds 36k

0

Entering edit mode

I agree. There are lots of Mac users in biology field. The port for Linux/Mac is schedule.

ADD REPLY • link 10.6 years ago by BioApps ▴ 800

0

Entering edit mode

Until the Linux port will be available (I promise it will be), the program can be used under Linux via Wine.

ADD REPLY • link 10.6 years ago by BioApps ▴ 800

2

Entering edit mode

Wine is rarely used in bioinfo. For your next project, please take linux/mac as a prerequisite, not an afterthought. Thank you.

ADD REPLY • link 10.6 years ago by lh3 33k

3

Entering edit mode

we often want to look at one file in a run but almost never would open all files in a sequencing run.

Usually they share many characteristics. Your software should have the option of running as a command line tool as well.

ADD REPLY • link 11.0 years ago by Istvan Albert 102k

1

Entering edit mode

command line

You mean to access the tools via that command line?

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

yes, like fastqc the program should run from command line if just some non-graphical functionality is needed.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by Istvan Albert 102k

1

Entering edit mode

I forgot to mention that it requires that much time only when you open a file for the first time. Opening the file subsequently requires below 1 sec.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

it should never need to have the file in memory, except for dereplicate so it should be able to handle files of any size except for that function, correct?

ADD REPLY • link 11.0 years ago by brentp 24k

1

Entering edit mode

Yes. As you can see in the screenshot the program needs only 38MB for showing a 500MB file.

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

Cool. Then why this limitation: "On a modest computer (with 3GB RAM) the program should theoretically open files up to 40GB"?

ADD REPLY • link 11.0 years ago by brentp 24k

0

Entering edit mode

Well, the index in loaded in memory. The more sequences you have, the larger the index. Some calculus shows that it should parse a file with up to 375million sequences, which is equivalent of a 80GB file IF the sequences are about 100 bases each (40GB is for 200 bases/sequence).

Obviously, on a computer with more RAM you could open even larger files. But for the moment the program is 32 bit. The 64 bit version should be ready soon. Then the Linux and Mac versions.

Now I am trying to integrate SFF into the same GUI.

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

2

Entering edit mode

ok but why would you need to load the entire index into memory? after all the user will not actually scroll through hundreds of millions of reads. There is this common flaw, often seen in text editors where opening a large file loads it all up in memory, yet a person only edits or looks at one page at a time.

ADD REPLY • link 11.0 years ago by Istvan Albert 102k

2

Entering edit mode

Update. Now the program will take 15MB of RAM no matter how large the file it is. There will be an update these days available for download.

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

If you need to perform an operation like getting the viewing, average quality, sorting, etc, you will need to parse all samples. Therefore, you need the index. Probably the index is need for any operation that applies to all samples. Please let me know if you have a different approach. I will try to incorporate it if results in smaller mem footprint.

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

1

Entering edit mode

I think one only need an index if they need to access random parts of the file at high speeds. Reading the all records in an entire file to compute certain values should not need indices.

In your case the only use of an index would be to jump to a read of a given name, this is almost never needed in practice.

I would suggest to make your tool be exceedingly fast and use trivial amounts of memory (I mean megabytes) regardless of the file size. This means a very fast parser and streaming the view rather than preloading it. Now that would be a tool that would separate itself.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by Istvan Albert 102k

0

Entering edit mode

Hi Albert!

The technical reasons for keeping the index in memory:

I don't read the file in text mode (line by line). Instead I read it in binary mode because I have a library that buffers the I/O operations. This results in better performance.
The viewer (let user scroll and see all sequences)
This is not only a viewer. I intend to add all kind of tools that will need random access to sequences (that is the main reason).
RAM is cheap. On a computer with 6-8GB RAM (which is quite common today, especially if you are a biologist that is working with large files :) ) the user will be able to open files around 160-210GB. I am not sure what is the largest FastQ file ever created, but I think 160-210GB is a nice range.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by BioApps ▴ 800

2

Entering edit mode

Well I think you should be careful with this and find yourself a collaborator solve their problems. Many things that make sense in theory do not make any difference in practice and do not scale for real work. Good software needs to solve real pain.

It is very rare for example that I would be going back to a fastq file after checking it out once, so the time that it saves me for the second opening is pointless. I just want to look at it once and verify that things look right. For that I would hate to be sitting for minutes staring at a GUI having the memory be consumed when I can do that faster by other means, fastqc will run in the background and I can limit the memory it uses. Moreover it is very likely that the project will have dozens of fastq files associated with it, I would hate having to click and open and wait for each.

So you see how your tool would give me very little reason to use it.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by Istvan Albert 102k

1

Entering edit mode

I also have to agree. Normally, nobody would ever look at the fastq-file. Overall statistics are important here (take a look at fastqc). The sequences themselves are not really interesting. They are far too many, anyway. How long can you scroll down your list? ;)

I would recommend you: Talk to biologists or bioinformaticians and ask them, what they really need!

What I would like/need:

very fast adapter prediction and clipping algorithm (for my 16GB fastq files the available ones are very slow, so I skip it)
fast bam statistics
- percentage of mapped reads, mapped mates, unmapped mates (multiple mappings should be counted once)
- percentage +/- strand mappings
- library complexity
- multiple mappings statistics
- DNA: mean + median + stdev of coverage
nice visualization tool for fusion transcripts

These are just three of many many more.... :)

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by David Langenberger 11k

0

Entering edit mode

(I agree with Istvan) The reason for an index is random access. The things you are proposing, e.g. viewing parts of a file or doing operations on the entire file do not require an index. For example, if they want to scroll backwards, you can seek().

ADD REPLY • link 11.0 years ago by brentp 24k

0

Entering edit mode

I edited the original post to show this is rather an editor than a simple viewer (true that for the moment only one editing function is ready :) )

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

Ok guys. Thanks for feedback. I heard your voices. I will get rid of index :)

Please keep sending good feedback!

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

Limitation removed!

ADD REPLY • link 10.6 years ago by BioApps ▴ 800

0

Entering edit mode

I would need it to view gzipped files. Our fastq are stored as gzip and dumping them out is a lot of IO hassle. Then let us know when the "linux port" is ready, because the windows machines are kept far away from the valuable data (more IO hassles).

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 11.0 years ago by karl.stamm 4.1k

0

Entering edit mode

Hi Karl. You mean, you want to work on your packed FastQ file without unpacking the whole file (this may be quite difficult) or you want the program to quietly unpack the program in the background and work on that temporary file? But, yes, I see how this may be a nice feature. Most Linux users probably have their files packed this way.

PS: One possible solution to your packing problem (but works only on Windows) would be not to pack your files with Zip but to use the default packing tool offered by NTFS ("Compress content to save disk space"). A 86MB file gets compressed to 43MB using the NTFS and 25MB using a zip algorithm. Which is not very bad if you consider that 'unpacking' the file is instantaneous.

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

Have a look at zlib. Reading gzip files is very easy.

ADD REPLY • link 10.9 years ago by lh3 33k

0

Entering edit mode

Thanks. I was happy to see there is a port for Delphi also. I will take a look.

ADD REPLY • link 10.9 years ago by BioApps ▴ 800

0

Entering edit mode

Update. Now the program will take 15MB of RAM no matter how large the file it is. There will be an update these days available for download.

ADD REPLY • link 11.0 years ago by BioApps ▴ 800

0

Entering edit mode

Adding support for SFF

ADD REPLY • link 10.9 years ago by BioApps ▴ 800

Ram · Answer 1 · 2014-05-29

1

Entering edit mode

10.9 years ago

BioApps ▴ 800

Release history

v2.0 / August 2014

Redesigned GUI
Save reports in HTML format
New tool: Demultiplexing - Split based on FastQ internal info (only for Illumina files) - Load info about adaptor clipping from sequence name (comment) line
New tool: Demultiplexing - Split multiplex file based on barcode sequence(s) provided by user. Reverse complement sequences are also supported.
New tool: Demultiplexing - Split based on FastQ internal info radio box is automatically disabled if the sequence does not contain the info
New tool: File splitter: Split huge FastQ/SFF file in chunks of x reads
New tool: File splitter: Cut all sequences in the specified range
New tool: Revamped file convertor. Converts between Fasta, FastQ, SFF
New tool: Remove overrepresented sequences.
New tool: Remove contaminants. Search overrepresented sequences against a contaminant database (allow user to add/remove seq from database)
New function in Adaptor Trimming: Cut x bases at 3' / 5' end

v1.9 / June 2014

New report: Sequence duplication level
New report: Overrepresented sequences

v1.7

Massive SFF/FastQ parsing speed optimization using buffered files
Important speed optimization when using the 'Refresh button'
The program is a bit more responsive when processing large files
Silently cut samples that have 0 good bases
Cut reads with GC under 15% or over 85%

v1.5

Added SFF support (processing, statistics, etc)
Tools - File splitter. Split huge FastQ/SFF file in chunks of x reads
Tools - Compact FastQ files (remove duplicate content of the + line)
Tools - Convert SFF to FastQ
Tools - Convert SFF to Fasta
Tools - Convert FastQ to Fasta (multiFasta)
Graph - All graphs are updated in real time (as filters are applied)
Graph - Sequence length distribution graph
Graph - Per base sequence quality graph
Graph - Per base GC content
Graph - Per sequence GC content
Graph - Per base sequence content
Graph - Per base N content (integrated in the 'Per Base Content' graph)
Graph - Show the 'Per sequence GC content' graphs as dot instead as lines
Graph - Resize graphs automatically
Graph - Remember height of each graph panels
Graph - Remember status of each graph panel (colapsed/expanded)
Graph - Let user scroll graphs using mouse scroll
Graph - Button to expand some graphs. Support for all graphs will be added soon
Graph - Added vertical scroll bar in graph's panel so the user can make any graph as long as he wants

v1.2

Tools - Trim poly-A/T tails
Tools - Cut reads with average QV under specified threshold
Tools - Cut reads if they contain N bases (the user can specify how many)
Tools - Cut reads longer than x bases
Tools - Ask where to save the file (at conversion)
Tools - Cut low complexity reads
Tools - Trim low quality ends. Automatically detect and cut low quality bases at the end of each read. Three parameters are used by this function.
Tools - Cut reads shorter than x bases.
Tools - Save the filtered file to disk (use the 'Refresh graph and save...' button).
Tools - Encoding auto detection was checked and works correctly.
Graph - Let user choose row height
Graph - Show all reads (no matter how many they are). It can show: Read name, Base sequence, average quality, sequence length, mini chromatogram.
Graph - Per sequence quality scores graph

Download link.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by BioApps ▴ 800

1

Entering edit mode

your tool needs a name

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Istvan Albert 102k

0

Entering edit mode

I know !!!!!!!!!! :)

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by BioApps ▴ 800

1

Entering edit mode

Next thing to come: FastQ speed improvement!

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by BioApps ▴ 800

1

Entering edit mode

The version 2.0 was ready some while ago but I didn't had the time to test it properly so I left for an 'important business' (read as 'holiday') before having the chance to publish the program. Sorry. I will release v2 soon.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.6 years ago by BioApps ▴ 800

0

Entering edit mode

Hi,

It seems mac and Linux download link is not working, do you know why or is the tool available only for the windows?

Thanks,

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.8 years ago by GP ▴ 10

0

Entering edit mode

Hi Gmax.

The program was not YET converted to Mac (or Linux). I intend to add few more features, some GUI improvements, bug fixes and lots of testing. Once I have a final-final version I will port it to Mac and later to Linux.

The ETA for v2.0 is ~7 days. Once we are there I will start to port it. For the moment the program should run without problems on Mac via CrossOver and on Linux via CrosOver or WinE.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.8 years ago by BioApps ▴ 800

1

Entering edit mode

Ok, Thanks very much!

G

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.8 years ago by GP ▴ 10

0

Entering edit mode

I must say this is starting to look very very impressive, so here are my two cents:

As others have said, you need Linux compatibility, some people don't wine seriously.

Next thing, command line. I see you have the option for HTML reports, it would be great to have something like "./ngs-wrokbench myfile.fastq -o my-file-dataset1" which would create an HTML report of everything and the user doesn't need to specify 1000 different options to do it. The point of the report is for the tool to guess as much as it can from the given dataset and tell user all of it's finding:

Did I find several multiplex barcodes? How many reads in each library?
Does my data look like paired end?
What common adapters was I able to find in the dataset, are they represented significantly? Can they tell me something about the data? Maybe I found a Nextera paired end adapter, maybe I found a mate-pair adapter?
k-mer graph, tells us about expected genome size, frequency of polymorphisms, presence of contaminants.

Lastly, source code... this is up to you, but giving people the option to compile it themselves will give you lots of credibility because it is open source. This way it can work even on cygwin for windows systems.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.6 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Buna Adrian.

Thanks a lot for your suggestions. They are very good I will implement all of them. The k-mer was already in the ToDo list. just need the time to implement it.

As I already said it, definitively there will be a port for Mac/Linux platforms, but first I want to finish this (and several other) tools. Then I start the porting. Until then I am sure that the scientists that REALLY need to use my tools (if they really really want them) can use WinE/CrossOver/etc. I don't thinks they pride will be that much hurt. The final purpose of bioinformatics is the 'bio' part... finding the answers to biology-related questions. The tools (the programs, the OS, the emulators) are just...well... tools. Biologists will understand that.

Related to the source code, unfortunately this will never be available. I got the permission to use some bioinformatics libraries that are closed source. For Windows and Mac world this is not a problem at all since most programs are not open source (most programs are not even free). But Biostars is a Linux-biased community, so it is normal for the people here to ask for the source code. But since I released the first version many biologists contacted me and they had platform-related questions but none asked for the source code. Probably even if I will distribute it, they won't know what to do with it :) They just want a 'double-click and run' tool.

Thanks again for your precious feedback.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.6 years ago by BioApps ▴ 800

1

Entering edit mode

Regarding releasing the source code: Would it not be possible to dynamically link against the closed source libraries you are using so they can be distributed in binary while the code of your tool is free?

I am looking forward to give your tool a try when you finish the linux version.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.3 years ago by lelle ▴ 830

0

Entering edit mode

A number of open-source media players and transcoders do dynamic linking, given codecs that are closed source or are not able to be redistributed under the open source licensing terms.

ADD REPLY • link 10.3 years ago by Alex Reynolds 36k

0

Entering edit mode

I don't see why we could not do that :)

Are interested in a specific module? Maybe I can write a special function that will do exactly what you need.

Or maybe you could start the program with the GUI hidden and pass some parameters in the command line. The program will process the file and exit silently.

Anyway, if you need something specific just let me know.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.3 years ago by BioApps ▴ 800

0

Entering edit mode

I will also look into plugins. I have never done this but it doesn't seem so complicated.