Question

Out of Sequencer Nanopore Raw Data

0

Entering edit mode

12 months ago

SomeOne ▴ 240

Hello Reader,

I hope you are at best of your health.

I am pretty new to OXford nanopore raw data. I have mostly seen it in a single fastQ file for one sample format.

Recently I have received nanopore sequence data for our project. and i see 5 files per sample named as following.

Isolate_1_fast5_fail.tar
Isolate_1_fast5_pass.tar
Isolate_1_fastQ_fail.tar
Isolate_1_fastQ_pass.tar
sequencing_summary_PAW77343_2a90c311_84f6a71c.txt

When I extract these files using tar -xf Isolate_1_fast5_fail.tar and now I have multiple files in each directory. like follow:

Isolate_1_fastQ_pass contains 269 *.fastq.gz files
Isolate_1_fastQ_fail contains 6 *.fastq.gz files
Isolate_1_fast5_pass contains 269 *.fast5 files
Isolate_1_fast5_fail contains 6 *.fast5 files

My intentions are to perform De Novo genome assembly. I know fast5 is native output format for Nanopore.

Question 1: What does the notation Fail/Pass mean ?

Question 2: For downstream analysis how to use the fastq files? Should i zcat all the *.fastq.gz to one fastq.gz file and use this for input to the assembler of choice ?

Question 3: Which assembler is recommended for the genome assembly of Fungal Nanopore sequence data. As I also have Illumina Short read sequence data for the same samples.

Your valuable feedback is welcomed.

Thanks.

fastq assembly fast5 genome nanopore • 1.9k views

ADD COMMENT • link updated 12 months ago by GenoMax 152k • written 12 months ago by SomeOne ▴ 240

1

Entering edit mode

To add to colindaven 's answer.

Q1 -

What doest the notation Fail/Pass mean ?

Normally reads that satisfy the criteria (qual >= 7.0 and length >= 0) are marked passed.

Q2 - You can use cat. zcat is not needed.

You will run pycoQC using sequencing_summary_PAW77343_2a90c311_84f6a71c.txt. That gives you a nice graphical overview of the run.

ADD REPLY • link 12 months ago by GenoMax 152k

score 1 · Answer 1 · 2024-07-12

1

Entering edit mode

12 months ago

colindaven 7.7k

fast5s are the files containing the raw signal, the basecaller - dorado or previously guppy - converts these into FASTQ.

Q1 - failing or passing reads are low or higher quality reads. Check the relevant quality distributions with a tool like pycoQC, and or after alignment, with cramino to get a feel for it.
Q2 - Yes
Q3 - try the flye assembler

ADD COMMENT • link 12 months ago by colindaven 7.7k

0

Entering edit mode

If my intentions are to generate the fastQ files again from fast5 files, Should I keep all pass and fail fast5 files merged or basecall them individually ?

Will it be of any significance to redo this basecalling again as I already have fastQ files which were provided by the sequenceing company. I guess the did some QC on it already while basecalling.

ADD REPLY • link 12 months ago by SomeOne ▴ 240

0

Entering edit mode

When was the basecalling done, and with which tool ? Is it ONT 10.4.1 data or older 9.4 data ? Was dorado used ? Is the data modern Q20+ ? If Q20 I use a minimum Q of 17 these days.

It should be fine to proceed with as such for a first assembly - see if you are happy with that before re-basecalling.

You can use the tool fastp to check the current quality distribution and exclude very short reads or poor reads.

If you really want to improve your read quallity and have a big server, try dorado correct.

ADD REPLY • link 12 months ago by colindaven 7.7k

0

Entering edit mode

I recieved this data like 2 3 days ago.

In data report they mentioned that dorado is used for the basrcalling. I am not sure about the specific parameters they used but my guess is that they went with default values. I have contacted the representative to know about the specific parameters used.

Based on the report, READ_MEAN_QIALITY is around 16 ± 0.2

If you really want to improve your read quallity and have a big server, try dorado correct.

dorado correct is similar to read correction like SPAdes does before assembly ?

ADD REPLY • link 12 months ago by SomeOne ▴ 240

0

Entering edit mode

If my intentions are to generate the fastQ files again from fast5 files

You should convert fast5 files into a single POD5 format file, if you are planning to recall bases using dorado.

If the basecalls you received at already high accuracy (HAC) then they may be sufficient as is.

ADD REPLY • link 12 months ago by GenoMax 152k

0

Entering edit mode

You should convert fast5 files into a single POD5 format file, if you are planning to recall bases using dorado.

I have multiple fast5 files. Should i merge/concatinate them before converting to POD5 format ?

if Yes should i also include the fast5_fail files ?

Like all the fast5 data in one POD5 file. Then do the basecalling again with new parameters defined by me ?

ADD REPLY • link 12 months ago by SomeOne ▴ 240

0

Entering edit mode

I have multiple fast5 files. Should i merge/concatinate them before converting to POD5 format ?

Use https://pod5-file-format.readthedocs.io/en/0.1.21/docs/tools.html#pod5-convert-fast5 or if you want a web interface https://pod5.nanoporetech.com/

should i also include the fast5_fail files ?

That is up to you. If you do that then keep in mind that dorado does not do any filtering by default. You will need to add a value using --min-qscore.

ADD REPLY • link 12 months ago by GenoMax 152k