How to convert raw Nanopore R9 fast5 files to fastq files ?
2
0
Entering edit mode
8 months ago
Lélé ▴ 10

Hello,

I am new at processing Nanopore sequencing data and am having an issue:

I have binary fast5 files directly out of the Nanopore sequencer (R9 Flowcell) and I would like to use Dorado to perform basecalling as it seems to be the preferred tool. I have already used dorado to do simplex basecalling on pod5 files using argument "hac" for the model as it says on the nanoporetech/dorado GitHub page.

However, I can't seem to make it work on fast5 files even though the documentation says that it is supported for simplex basecalling (though less performant). This is what I type in my terminal:

$ dorado basecaller hac /directory/to/my/fast5/files --emit-fastq > output.fastq 

I keep getting this error:

[error] Cannot automate model selection using fast5 files

I have tried using "fast" or "sup" instead of "hac" in case it would make a difference but to no avail.

Is there a specific model I should use or download ? Any other tools you could recommend for basecalling from fast5 files ? I know about guppy however I am unable to download it as it is an ONT tool.

Any help would be greatly appreciated.

Thanks,

Lele

dorado fast5 basecalling nanopore • 2.8k views
ADD COMMENT
1
Entering edit mode
8 months ago

According to the documentation:

Dorado can automatically select a basecalling model using a selection of model speed (fast, hac, sup) and the pod5 data. This feature is not supported for fast5 data. If the model does not exist locally, dorado will automatically downloaded the model and delete it when finished. To re-use downloaded models, manually download models using dorado download

So in your case, try downloading the model locally first using

dorado download --model all

then do the basecalling.

dorado basecaller hac@latest /directory/to/my/fast5/files --emit-fastq > output.fastq
ADD COMMENT
0
Entering edit mode

Thank you for your reply. I have just tried this as well as using hac@v4.2.0 however I still get the same error... Guess I'll have to figure out another way.

ADD REPLY
4
Entering edit mode

Have you tried specifying exactly the model ? For instance dna_r9.4.1_e8_hac@v3.3 for R9 flowcell.

ADD REPLY
0
Entering edit mode

Just replaced hac by the full name of one of the downloaded model and it works. Silly mistake on my part as hac@latest still automatically chooses a model for you as it says in the model complex table.

Thanks again !

ADD REPLY
0
Entering edit mode

Would that have changed lately?

I was chatting with one of the developers of dorado recently ( in the frame of a bug report) and he told me that the model name start at 'hac' 'fast' or 'sup' and that all the text before it should be omitted. If not dorado assumes you provide a path to a model and will fail. (alternatively you can provide the full path indeed to the model, and then you need to add the dna_..._ part in the name)

They acknowledge themselves it is indeed a bit confusing ;)

(this is for dorado from at least v0.8 onwards)

ADD REPLY
0
Entering edit mode

If you have access to internet then dorado will automatically download the correct model. You only need to specify the level of calling as hac/sup etc.

ADD REPLY
0
Entering edit mode

that indeed works as well but this is in the case where you (for some reason) would like to use a specific model (and/or version of it)

ADD REPLY
0
Entering edit mode

interesting point though: would you assume that most (all?) always use the latest most recent model? if so then indeed the system where you just ask 'hac' 'sup' or 'fast' is likely the easiest (and would make the keeping-it-up-to-date work a lot smoother :) )

if you are running it on HPC systems without internet access it's a different ballgame of course

ADD REPLY
1
Entering edit mode

Main point is you want to make sure you use the right pore version. That seems to automagically happen if you have internet access.

ADD REPLY
0
Entering edit mode

True indeed. but if you pre-download all models, dorado will also do that without internet access.

Well, it will always do that without internet access (it's in the pod/fast files itself) but downloading the models , if you don't have them already, without internet access will be difficult of course :)

ADD REPLY
0
Entering edit mode
8 months ago
Dave Carlson ★ 1.9k

Your best bet is probably to convert your fast5 files to pod5. This can be accomplished with one of the pod5 tools:

https://pod5-file-format.readthedocs.io/en/latest/docs/tools.html#pod5-convert-fast5

ADD COMMENT
0
Entering edit mode

Thank you for your help. However I was hoping to compare both techniques: R9 flowcell which is in fast5 format and R10 in pod5. Wouldn't converting fast5 into pod5 be a bias for the comparison ?

ADD REPLY
1
Entering edit mode

I don't think so. POD5 is a file format (like fast5) and it is much more efficient with dorado. I don't recall the exact number but there was a speed up of several fold when using a GPU to rebasecall.

ADD REPLY

Login before adding your answer.

Traffic: 2560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6