Question

Analysis of Nanopore cDNA sequencing data

1

Entering edit mode

2.5 years ago

Nico80 ▴ 80

We are testing cDNA sequencing on an ONT MinION, and I am trying to get my head around the different data analysis pipelines.

We did the analysis directly on the MinION (basecalling with Guppy + alignment with minimap2) using the mouse genome ENSEMBL reference found here.

The result is a lot of small fastq + bam files. I joined all the bam files using samtools and got a single bam file which I can display in IGV.

Here is an example of what I see which, at least for this particular gene, seems to match the RefSeq annotation relatively nicely.

screenshot from IGV for the aligned data from the MinION

As a different strategy, I tried to merge all fastq.gz files from the MinION using

cat *.fastq.gz > merged.fastq.gz

then align them to the same genome using minimap2 after filtering for quality >8

zcat merged.fastq.gz | NanoFilt -q 8 --headcrop 50 | gzip > filtered-reads.fastq.gz

./minimap2 -ax splice Mus_musculus.GRCm39.dna.primary_assembly.fa filtered-reads.fastq.gz | samtools sort -o merged.bam

If I now look at this in IGV I see this (bottom is the bam from the MinION as in the previous image, top is my mapping)

screenshot from IGV comparing the aligned data from the MinION with the manually aligned one

So, the alignment is completely different, and there are a lot of mismatched bases...

Clearly I am doing something wrong, but I never used minimap2 before and I find the documentation sometimes a bit cryptic.

So, here are a few questions I hope you can help me answer:

Am I correct in using that reference genome or should I use something else (that is what I normally use for aligning short reads with STAR)
Why are the two approaches giving such different results?
Even in the analysis from the MinION, I can see a lot of reads aligning outside of exons; how do I interpret that?

alignment nanopore sequencing minimap2 • 3.4k views

ADD COMMENT • link 2.4 years ago by Nico80 ▴ 80

2

Entering edit mode

Using the top level assembly should be fine.

then align them to the same genome using minimap2 after filtering for quality >8

Not sure how you did this but clearly you removed some data that was originally present. So you are no longer doing an identical comparison. How much data did you lose?

fastq_pass (and fastq_fail) folder will have multiple fastq files (ONT splits the data by a certain number of reads) and one can simply cat these together as you feed them into an alignment run. It sounds like you made individual BAM files from individual fastq files and then merged them? You should get identical results since you are essentially working with the same data.

I can see a lot of reads aligning outside of exons; how do I interpret that?

This may be related to your prep. Reads may simply be aligning due to homology? Are you seeing large pileups? If you are then that may need a separate explanation.

ADD REPLY • link 2.5 years ago by GenoMax 151k

0

Entering edit mode

Hi GenoMax, thanks for your answer. To clarify a bit more:

Not sure how you did this but clearly you removed some data that was originally present. So you are no longer doing an identical comparison. How much data did you lose?

Fair point. I did this using NanoFilt (see code above). However I would still expect to see similar alignment, just fewer reads. In any case I have since tried to redo the alignment without filtering and the same happens...

fastq_pass (and fastq_fail) folder will have multiple fastq files (ONT splits the data by a certain number of reads) and one can simply cat these together as you feed them into an alignment run. It sounds like you made individual BAM files from individual fastq files and then merged them? You should get identical results since you are essentially working with the same data.

Yes I either 1) took all of the bam files generated by the MinION, and merge them or 2) take all of the FASTQ files in the fastq_pass folder, join them and align them. I wasn't expecting 100% match, between the two methods, but here we're more close to 0%!

This may be related to your prep. Reads may simply be aligning due to homology? Are you seeing large pileups?

Not quite sure how to check that, could you elaborate on this point?

ADD REPLY • link 2.5 years ago by Nico80 ▴ 80

1

Entering edit mode

hi, it looks like you use two different references, one (bottom) corresponds to IGV, while the top is different.
could you do simple alignment (without trimming)

./minimap2 -ax splice Mus_musculus.GRCm39.dna.primary_assembly.fa guppy_outdir/*.fastq.gz | samtools sort --write-index -o guppy_outdir.bam

and make sure you use GRCm39 in IGV?

ADD REPLY • link 2.5 years ago by Leszek 4.2k

0

Entering edit mode

Hi, thanks for your answer.

I have tried pretty much any combination of reference/settings/trimming (or not) that I can think of.

I agree that might be an issue of visualization in IGV, but still when I try to count the transcripts (e.g. using htseq-count or summarizeOverlaps) I see a lot (>60%) of unaligned reads so something's amiss.

I will continue to investigate...

ADD REPLY • link 2.4 years ago by Nico80 ▴ 80