Question

Help aligning Nanopore Long Read RNA seqs

0

Entering edit mode

10 months ago

sewaw37819 • 0

I am new to the field of bioinformatics and attempting to map various nanopore long read rna samples with the GENCODE hg38 as the reference genome and long non-coding RNA transcript sequences as the annotation, using minimap2 -a -ax splice -k14 -uf --junc-bed [annotation] [reference] > [output].sam. I then convert the .sam to .bam and index it, and run it through htseq-count, but 98% of reads are listed as __not_aligned (though there IS some coverage, but mostly on whole chromosomes). I have used several variations of this command and am still getting a very high amount of unaligned reads. Is there something I am doing wrong at the mapping step or should I investigate the quality of the data first? Sorry if these are obvious answers, I am again new to this field.

rna-seq long-read minimap2 • 1.5k views

ADD COMMENT • link updated 9 months ago by GenoMax 150k • written 10 months ago by sewaw37819 • 0

0

Entering edit mode

What percentage of your library do you expect to be "long-non-coding transcripts"?

(though there IS some coverage, but mostly on whole chromosomes) What does this mean? Some coverage over your annotations? Or genome wide?

Does minimap2 give you a summary of the number of mapped reads? If not you can also check the bam file for how many reads are mapped (samtools view can help with this by feeding -f/-F with flags: https://broadinstitute.github.io/picard/explain-flags.html))

ADD REPLY • link 10 months ago by rfran010 ★ 1.4k

0

Entering edit mode

What percentage of your library do you expect to be "long-non-coding transcripts"?

A higher percentage than what we are seeing right now, but we are trying to figure that out.

What does this mean? Some coverage over your annotations? Or genome wide?

This is samtools coverage:

#rname  startpos        endpos  numreads        covbases        coverage        meandepth       meanbaseq       meanmapq
chr5    1       181538259       1       405     0.000223093     2.23093e-06     34.5    1
chr6    1       170805979       2       1029    0.000602438     6.02438e-06     32      36
chr9    1       138394717       1       552     0.000398859     3.98859e-06     33.9    60
chr10   1       133797422       1       509     0.000380426     3.80426e-06     37.2    49
chr11   1       135086622       1       273     0.000202093     2.02093e-06     33.5    60
chr17   1       83257441        1       700     0.000840766     8.40766e-06     33.9    60
chr21   1       46709983        2       873     0.00186898      1.86898e-05     29.8    0
ML143377.1      1       519485  1       598     0.115114        0.00115114      29.7    0
KI270733.1      1       179772  1       511     0.284249        0.00284249      31      0

how many reads are mapped

And samtools flagstat:

4031 + 0 in total (QC-passed reads + QC-failed reads)
4000 + 0 primary
31 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
42 + 0 mapped (1.04% : N/A)
11 + 0 primary mapped (0.27% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I get the exact same coverage when either using long non-coding rna annotation or the second comprehensive gene annotation.

ADD REPLY • link 10 months ago by sewaw37819 • 0

1

Entering edit mode

That is very few reads. Are you sure your run was successful?

ADD REPLY • link 10 months ago by WouterDeCoster 47k

0

Entering edit mode

Can you elaborate what you mean by a successful run? The files do correctly generate without crashing.

ADD REPLY • link 9 months ago by sewaw37819 • 0

0

Entering edit mode

The experiment seems to have produced very few reads, making me think there is something wrong with the library or flow cell. Which could also mean that those reads that did get generated are garbage.

ADD REPLY • link 9 months ago by WouterDeCoster 47k

1

Entering edit mode

Thanks! I think you might be misinterpreting your coverage output. I believe that just indicates the chromosome and how many reads map to it, and how many base-pairs of the chromosome are covered, so the whole chromosome itself is not covered. (hopefully I interpret your original meaning correctly).

So it looks like the few reads you have happen to map to those chromosomes. With about one or two reads covering on average 500 bp (probably the length of the fragments is around 500bp?)

Here's an example with RNA-seq short reads (mouse)

#rname  startpos        endpos  numreads        covbases        coverage        meandepth       meanbaseq       meanmapq

chr1    1       195471971       8780440 32195618        16.4707 6.3698  36.2    196

Definitely you have an issue with the quality of the data. Whether the library didn't work or the cell was corrupted somehow. If it was a nanopore problem, the QC on that should let you know.

Usually you confirm the library size and quantity is correct before sequencing, so maybe there's a disconnect when submitting libraries. Library contamination (e.g. purifying RNA from bacteria instead of human) could lead to low mapping, but you have in general low data output. Maybe wrong barcode assignment? or the library was vastly underloaded (although this wouldn't explain the very low mapping rate, meaning there's not a lot of human in the library).

I'd be curious if the unmapped reads are still good length and quality. (this goes back to the nanopore QC) I'd be curious if they map to anything else. Then, see if there could be barcode misassignment. Otherwise, if possible I would review the benchtop QC.

ADD REPLY • link 10 months ago by rfran010 ★ 1.4k

0

Entering edit mode

I'd be curious if the unmapped reads are still good length and quality.

Not sure if this is what you are asking, but running htseq-count on the indexed bam file returns that of the 4031 reads, only 5 are "__too_low_aQual", 1 is ambiguous, and 3989 are "__not_aligned".

As for QC and sequencing, I will have to get back to you on that because other members of my team did it. However, running the fastq files that I was provided through FastQC flags 0 sequences as low quality out of 585,455.

ADD REPLY • link 9 months ago by sewaw37819 • 0

0

Entering edit mode

Sorry if I'm confused/missed something. Why do you have only 4000 reads in the bam, but you mention 5854555 reads here?

ADD REPLY • link 9 months ago by rfran010 ★ 1.4k

0

Entering edit mode

I am running minimap2 with the fasta files of the individual samples, while the qc reads are all of them combined.

ADD REPLY • link 9 months ago by sewaw37819 • 0