Hi everyone,
I have just started the bioinformatic analysis of my first ever ONT long-read sample and I'm quite lost, so I have several questions.
Sequencing output
To put you in context, we're interested in the sequence but also in methylation detection. Basecalling was done by the sequencing department (model dna_r10.4.1_e8.2_400bps_sup@v4.3). I recieved two type of files:
- BAM files
- FASTQ files
Here comes question #1. The FASTQ is the read sequence and the BAM is the position of modified bases taking the sequencing reads as the reference sequence, right? So, for now, to only map the reads to the reference genome, I can ignore ghe BAMs, right?
Sequencing QC
To my understanding, N50, Gb output and base quality are the most important parameters. For this QC I used NanoPlot
which I read is the standard and serves as an anologous to FastQC
. I am content enough with the seuencing output.
Mapping
Then, I ran the mapping against the reference genome using minimap2
.
There I encountered my first problem, when I use the -c/-a
options (which do base-level mapping) I get NO mapping reads. Otherwise, running minimap2
with the deafult approximate mapping (no -c/-a
options) I do get many records in the PAF file, but I don't know if I can trust them.
Mapping QC
Here I couldn't find any guides/pipelines/tools to do the QC of the mapping job. So my quesiton #2 is: what about mapping/alignment quality?
I could find the dotplot appears to be a commonly used way to check how much the query (our sequencing reads) matches the reference (reference genome). I undesrstood that good quality would mean a mostly diagonal line, I don't know if this is correct. Then, when I plot my data I can only see vertical lines
paf <- read.table("file.paf", fill = TRUE,
col.names = paste0("V",seq_len(max(count.fields("file.paf", sep = "\t")))))
ggplot(data = paf) +
geom_segment(aes(x = V8, xend = V9, y = V3, yend = V4))
I also used the pafr
R package and got no dotplot. When I checked the chromosome-level coverage of the genome it was OK, all the chromsomes were covered.
Sorry for the long message. Does sombody have a clue about what is happening with this data? Can I trust this data? Did I do something wrong? Any feedback/help will be much appreciated.
If you are not working with methylation calls you can simply use the fastq format files for all analysis. Assuming both BAM and fastq were produced using identical basecalling model.