Question

Estimate error-rate and assembly oftion for FLYE assembler

0

Entering edit mode

4 weeks ago

Umer ▴ 160

Hi reader,

I have some nanopore sequenced fungal sample. Sequenced with R10.4 cell and LSK114 kit.

Some initial Information

Median Length average: 8055bp
N50 on average: 11109
Meadian Quality average: 16.7

Question 1: I am not sure about the term Error-Rate in sequencing. Can you guide or point me to direction where I can read about it and calculate it in my data sets ?

The Sequencing company already basecalled the data with Dorado and provided fastQ files as pass/fail. I want to generate assemblies so using fastQ-Pass files for downstream analysis.

I have used Porechop to remove adapters and Chopper to remove reads with Q<10 and Length<2000.

I am using Flye v-2.9.5-b1801 for genome assemblies. I have question regarding selecting the read-type option.

Question 2: out of these two which one is more suitable for my data .

--nano-raw path [path ...]
                        ONT regular reads, pre-Guppy5 (<20% error)
--nano-hq path [path ...]
                        ONT high-quality reads: Guppy5+ SUP or Q20 (<5% error)

Also, Is the --scaffold option in Flye assembler recomended to use ?

Thank you.

flye assembly genome error-rate nanopore • 470 views

ADD COMMENT • link 28 days ago by Umer ▴ 160

score 0 · Answer 1 · 2024-11-22

0

Entering edit mode

4 weeks ago

Istvan Albert 102k

Error rate typically means a Phred like score

https://en.wikipedia.org/wiki/Phred_quality_score

where the error rate E is plugged into the formula

Error probability = 10^(-E/10)

so for example E=20 would be 10^-2 --> P = 1/100 = 0.01 that is 1% error, one error every hundred basecalls.

Sometimes people call it E or Q, sometimes it is shown as P (probability) as a fraction, and sometimes it is expressed as a percent, so it can be a bit confusing.

ADD COMMENT • link 4 weeks ago by Istvan Albert 102k

0

Entering edit mode

Thank you for clerification.

based on my Raw data i get error rates around 2–2.2%. Shold i use --nano-hq or --nano-raw.

what do you suggest based on your experience ?

ADD REPLY • link 4 weeks ago by Umer ▴ 160

0

Entering edit mode

If data has been basecalled with dorado using super accuracy (SUP) or high accuracy (HAC) models then use the --nano-hq.

ADD REPLY • link 4 weeks ago by GenoMax 148k

0

Entering edit mode

Hi. What if i don't know which dorado model has been used ? Is there any way to find that out from fast5 fastQ or the summary file created by of each sample.

ADD REPLY • link 4 weeks ago by Umer ▴ 160

0

Entering edit mode

Fastq file header should contain the model used for basecalling. I see that in files I work with that are rebasecalled. Q score distribution should be better than Q20 (on avg), if the data is high or super accuracy.

ADD REPLY • link 4 weeks ago by GenoMax 148k

0

Entering edit mode

Thank you. I found it to be basecall_model_version_id=dna_r10.4.1_e8.2_400bps_hac@v4.2.0 but my Q-score is Q16 on average. I have already generated assemblies using --nano-raw and polished them but now i am confused if i had to use --nano-hq

ADD REPLY • link 28 days ago by Umer ▴ 160