Question

Attempts to demultiplex long reads from .pod5 result in unclassified reads

0

Entering edit mode

4 months ago

Placeholder@12654926 • 0

Appreciate any advice or suggestions regarding the above: I have been trying to demultiplex long read data using Dorado. My input includes .pod5 files and the first part of my workflow includes the use of Dorado's basecaller and demux functions, as shown below:

dorado basecaller --emit-moves hac,5mCG_5hmCG,6mA --recursive --reference ${REFERENCE} ${INPUT} > calls3.bam -x "cpu"
dorado demux --output-dir ${OUTPUT2} --no-classify ${OUTPUT}

I previously had no issues basecalling and subsequently processing long read data using the above basecaller function. However, the above code results in only a single .bam file of unclassified reads being generated in the ${OUTPUT2} directory. I have further verified using

dorado summary ${OUTPUT} > summary.tsv

that my reads are all unclassified. A section of them in the summary.tsv are as shown below. I am stumped and not sure why this is the case. I am working under the assumption that these files have appropriate barcoding for at least 20% of reads (and even if trimming in basecaller affects the barcodes, I would still expect at least some classified reads). Would anyone have any suggestions on changes to the basecaller function I'm using?

filename    read_id run_id  channel mux start_time  duration    template_start  template_duration   sequence_length_template    mean_qscore_template    barcode alignment_genome    alignment_genome_start  alignment_genome_end    alignment_strand_start  alignment_strand_end    alignment_direction alignment_length    alignment_num_aligned   alignment_num_correct   alignment_num_insertions    alignment_num_deletions alignment_num_substitutions alignment_mapq  alignment_strand_coverage   alignment_identity  alignment_accuracy  alignment_bed_hits

second.pod5 556e1e16-cb98-465e-b4a3-8198eedbe918    09e9198614966972d6d088f7f711dd5f942012d7    109 1   3875.42 1.1782  3875.42 1.1762  80  4.02555 unclassified    *   -1  -1  -1  -1  *   0   0   0   0   0   0   0   0   0   0   0

second.pod5 85209b06-8601-4725-9fe2-b372bfd33053    09e9198614966972d6d088f7f711dd5f942012d7    277 3   3788.21 1.4804  3788.38 1.3092  61  3   unclassified    *   -1  -1  -1  -1  *   0   0   0   0   0   0   0   0   0   0   0

second.pod5 beb587cf-5294-4948-b361-f809f9524fca    09e9198614966972d6d088f7f711dd5f942012d7    389 2   3749.87 0.6752  3749.99 0.5544  213 16.948  unclassified    chr16   26499318    26499489    40  209 +   171 169 169 0   2   0   60  0.793427    1   0.988304    0

Thank you.

dorado sequencing Long-read demultiplex • 533 views

ADD COMMENT • link updated 4 months ago by GenoMax 152k • written 4 months ago by Placeholder@12654926 • 0

0

Entering edit mode

I previously had no issues basecalling and subsequently processing long read data

Using the same exact command? BTW: What is -x "cpu" at the end of first command after the redirect? One would normally be able to use dorado basecaller ... | dorado demultiplex .... Have you upgraded your version of dorado recently. Sometimes programmers make major changes to options which may not be apparent until you start looking at change log or in-line help.

I am working under the assumption that these files have appropriate barcoding for at least 20% of reads

That could be wrong.

ADD REPLY • link 4 months ago by GenoMax 152k