Question

Nanopore long-read sequencing doubts and problems

0

Entering edit mode

5 weeks ago

ferran.llobet ▴ 10

Hi everyone,

I have just started the bioinformatic analysis of my first ever ONT long-read sample and I'm quite lost, so I have several questions.

Sequencing output

To put you in context, we're interested in the sequence but also in methylation detection. Basecalling was done by the sequencing department (model dna_r10.4.1_e8.2_400bps_sup@v4.3). I recieved two type of files:

BAM files
FASTQ files

Here comes question #1. The FASTQ is the read sequence and the BAM is the position of modified bases taking the sequencing reads as the reference sequence, right? So, for now, to only map the reads to the reference genome, I can ignore ghe BAMs, right?

Sequencing QC

To my understanding, N50, Gb output and base quality are the most important parameters. For this QC I used NanoPlot which I read is the standard and serves as an anologous to FastQC. I am content enough with the seuencing output.

Mapping

Then, I ran the mapping against the reference genome using minimap2.

There I encountered my first problem, when I use the -c/-a options (which do base-level mapping) I get NO mapping reads. Otherwise, running minimap2 with the deafult approximate mapping (no -c/-a options) I do get many records in the PAF file, but I don't know if I can trust them.

Mapping QC

Here I couldn't find any guides/pipelines/tools to do the QC of the mapping job. So my quesiton #2 is: what about mapping/alignment quality?

I could find the dotplot appears to be a commonly used way to check how much the query (our sequencing reads) matches the reference (reference genome). I undesrstood that good quality would mean a mostly diagonal line, I don't know if this is correct. Then, when I plot my data I can only see vertical lines

paf <- read.table("file.paf", fill = TRUE,
                        col.names = paste0("V",seq_len(max(count.fields("file.paf", sep = "\t")))))
ggplot(data = paf) +
  geom_segment(aes(x = V8, xend = V9, y = V3, yend = V4))

Dotplot of my sample

I also used the pafr R package and got no dotplot. When I checked the chromosome-level coverage of the genome it was OK, all the chromsomes were covered.

Sorry for the long message. Does sombody have a clue about what is happening with this data? Can I trust this data? Did I do something wrong? Any feedback/help will be much appreciated.

minmap2 long-reads NanoPlot Nanopore • 677 views

ADD COMMENT • link updated 4 weeks ago by GenoMax 150k • written 5 weeks ago by ferran.llobet ▴ 10

0

Entering edit mode

If you are not working with methylation calls you can simply use the fastq format files for all analysis. Assuming both BAM and fastq were produced using identical basecalling model.

ADD REPLY • link 4 weeks ago by GenoMax 150k

score 2 · Answer 1 · 2025-02-23

2

Entering edit mode

5 weeks ago

JC 13k

if the nanopore data was called with a recent version of dorado, you can have methylation calls in your BAM file (MM and ML fields), you can check that with samtools view <your bam> | more and check if those fields exist. If there are there, you can use modkit to generate a BED file with methylation calls.

ADD COMMENT • link 5 weeks ago by JC 13k

0

Entering edit mode

Hi JC,

Thank you for your answer. Yes, I do see these fields in the BAM files that I have.

Then, just another quick question. So the BAM isn't mapped to any reference, its just a way of having sequence data + methylation status in the same file (which I understand is unsupported in FASTQ format), right?

ADD REPLY • link 5 weeks ago by ferran.llobet ▴ 10

1

Entering edit mode

more or less yes

The BAM that is produced is just an alternative for the FASTQ files, and will become the new standard for ONT output files (because they can contain much more sorts of data than the FASTQ files, for instance methylation info as stated elsewhere here) .

ADD REPLY • link 5 weeks ago by lieven.sterck 15k

1

Entering edit mode

That depends if the base call was done with a reference genome, so, the BAM could be with or without alignments.

ADD REPLY • link 5 weeks ago by JC 13k

score 1 · Answer 2 · 2025-02-23

I only have a little experience with this data type.

Here comes question #1. The FASTQ is the read sequence and the BAM is the position of modified bases taking the sequencing reads as the reference sequence, right? So, for now, to only map the reads to the reference genome, I can ignore ghe BAMs, right?

Yes, you can proceed with mapping using just FASTQs. However, in my understanding, the model you listed does not include modified base calling, so the bam may not include that data either. You may want to follow-up with your sequencing center.

There I encountered my first problem, when I use the -c/-a options (which do base-level mapping) I get NO mapping reads.

I'm not sure what could cause this. Maybe the reads need some trimming? I've used Dorado base caller and it says it removes adapter/primer by default, and I'm not sure they would be significant enough to cause a problem with mapping.

Here I couldn't find any guides/pipelines/tools to do the QC of the mapping job. So my quesiton #2 is: what about mapping/alignment quality?

The principles should be the same as any other alignment, maybe start with samtools? I would recommend checking the reads manually though. They are going to be noisier than Illumina reads, but I look for plenty of long mapped reads, and I would also consider good coverage as a sign of high quality data. Ultimately, you are sequencing the genome directly. So if the reads themselves pass quality filters, then they should be good to go.

lieven.sterck · Answer 3 · 2025-02-24

I use this command to map ONT reads - which as JC says have previously had the methylation called during basecalling with Dorado.

Note the -a to get sam format
Then use samtools, cramino et al to check your alignments
The -R is for reads groups which is not necessary for your initial applications
params are just parameter variables from my nextflow.config
you can change the -x lr:hq to map-ont I believe if your data are lower quality.
-y is important since it will transfer the MM and ML tags from your fastq (if present, see the Dorado comments) to your bam
After mapping use Modkit.

minimap2 -y -x lr:hq -a -R "@RG\tID:$params.genotype\tSM:$params.sample\tPL:$params.platform\tPU:$params.platform\tLB:$params.library" --split-prefix ${prefix} -t 16 -o ${prefix}.sam $params.fasta $fastq