I am new to the field of bioinformatics and attempting to map various nanopore long read rna samples with the GENCODE hg38 as the reference genome and long non-coding RNA transcript sequences as the annotation, using minimap2 -a -ax splice -k14 -uf --junc-bed [annotation] [reference] > [output].sam
. I then convert the .sam to .bam and index it, and run it through htseq-count, but 98% of reads are listed as __not_aligned
(though there IS some coverage, but mostly on whole chromosomes). I have used several variations of this command and am still getting a very high amount of unaligned reads. Is there something I am doing wrong at the mapping step or should I investigate the quality of the data first? Sorry if these are obvious answers, I am again new to this field.
What percentage of your library do you expect to be "long-non-coding transcripts"?
Does minimap2 give you a summary of the number of mapped reads? If not you can also check the bam file for how many reads are mapped (samtools view can help with this by feeding -f/-F with flags: https://broadinstitute.github.io/picard/explain-flags.html))
A higher percentage than what we are seeing right now, but we are trying to figure that out.
This is
samtools coverage
:And
samtools flagstat
:I get the exact same coverage when either using long non-coding rna annotation or the second comprehensive gene annotation.
That is very few reads. Are you sure your run was successful?
Can you elaborate what you mean by a successful run? The files do correctly generate without crashing.
The experiment seems to have produced very few reads, making me think there is something wrong with the library or flow cell. Which could also mean that those reads that did get generated are garbage.
Thanks! I think you might be misinterpreting your coverage output. I believe that just indicates the chromosome and how many reads map to it, and how many base-pairs of the chromosome are covered, so the whole chromosome itself is not covered. (hopefully I interpret your original meaning correctly).
So it looks like the few reads you have happen to map to those chromosomes. With about one or two reads covering on average 500 bp (probably the length of the fragments is around 500bp?)
Here's an example with RNA-seq short reads (mouse)
Definitely you have an issue with the quality of the data. Whether the library didn't work or the cell was corrupted somehow. If it was a nanopore problem, the QC on that should let you know.
Usually you confirm the library size and quantity is correct before sequencing, so maybe there's a disconnect when submitting libraries. Library contamination (e.g. purifying RNA from bacteria instead of human) could lead to low mapping, but you have in general low data output. Maybe wrong barcode assignment? or the library was vastly underloaded (although this wouldn't explain the very low mapping rate, meaning there's not a lot of human in the library).
I'd be curious if the unmapped reads are still good length and quality. (this goes back to the nanopore QC) I'd be curious if they map to anything else. Then, see if there could be barcode misassignment. Otherwise, if possible I would review the benchtop QC.
Not sure if this is what you are asking, but running htseq-count on the indexed bam file returns that of the 4031 reads, only 5 are "__too_low_aQual", 1 is ambiguous, and 3989 are "__not_aligned".
As for QC and sequencing, I will have to get back to you on that because other members of my team did it. However, running the fastq files that I was provided through FastQC flags 0 sequences as low quality out of 585,455.
Sorry if I'm confused/missed something. Why do you have only 4000 reads in the bam, but you mention 5854555 reads here?
I am running minimap2 with the fasta files of the individual samples, while the qc reads are all of them combined.