Question

Unaligned rRNA in RNAseq in balb/cj mouse

0

Entering edit mode

4 months ago

karlensberg • 0

I'm relatively new to bioinformatics, and I have been working to analyze bulk RNA sequencing data from my lab and I've been having some issue with my alignment using STAR. For reference the samples are tumors from mice with a BALB/cj background. Here's the command I'm running:

STAR \
  --runThreadN 5 \
  --runMode alignReads \
  --outSAMtype BAM SortedByCoordinate \
  --genomeDir balbMHV68refKE \
  --readFilesIn AC05trimmed.R1.fastq.gz AC05trimmed.R2.fastq.gz \
  --readFilesCommand gunzip -c \
  --outSAMunmapped Within KeepPairs

And Here is the final output log

Started job on |    Sep 02 11:51:28
                                 Started mapping on |   Sep 02 11:52:24
                                        Finished on |   Sep 02 13:56:44
           Mapping speed, Million of reads per hour |   56.55

                          Number of input reads |   117191870
                      Average input read length |   284
                                    UNIQUE READS:
                   Uniquely mapped reads number |   64004518
                        Uniquely mapped reads % |   54.62%
                          Average mapped length |   262.87
                       Number of splices: Total |   11243286
            Number of splices: Annotated (sjdb) |   10046054
                       Number of splices: GT/AG |   10585630
                       Number of splices: GC/AG |   78986
                       Number of splices: AT/AC |   12783
               Number of splices: Non-canonical |   565887
                      Mismatch rate per base, % |   1.68%
                         Deletion rate per base |   0.03%
                        Deletion average length |   2.73
                        Insertion rate per base |   0.20%
                       Insertion average length |   3.12
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   5889851
             % of reads mapped to multiple loci |   5.03%
        Number of reads mapped to too many loci |   3448
             % of reads mapped to too many loci |   0.00%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   47277998
                 % of reads unmapped: too short |   40.34%
                Number of reads unmapped: other |   16055
                     % of reads unmapped: other |   0.01%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

I'm not sure why I am getting such a large percentage of reads that are unmapped and "too short". I pulled out all of the reads that weren't getting mapped and BLASTed some of them. All of the ones I looked at seemed to be some kind of rRNA. After doing some reading I found out that there can be some non-chromosomal scaffolds/contigs and I'm wondering if because I'm using a BALB/cj reference genome (which to my knowledge is less developed than the B6 genome reference), maybe those aren't included? For anyone who has done alignment with BALB/cj samples, what reference genome do you use, and have you had this issue before? Any advice would be incredibly helpful!

reference-genome RNA-seq • 506 views

ADD COMMENT • link updated 4 months ago by dsull ★ 7.2k • written 4 months ago by karlensberg • 0

score 1 · Answer 1 · 2024-09-05

1

Entering edit mode

4 months ago

dsull ★ 7.2k

55% uniquely mapped reads isn't bad at all.

For mice, I always just use the standard B6 reference; the polymorphisms are too subtle (relatively speaking) to throw off most alignments. The other references aren't as high quality, which could cause fewer reads to get aligned.

In any case, since the unmapped reads are mostly just rRNA, it might just be because your library has high ribo content. rRNA are highly abundant in tissue and can contribute to a large amount of reads.

ADD COMMENT • link 4 months ago by dsull ★ 7.2k

0

Entering edit mode

Thank you for the help! I have a couple of follow up questions:

Is there a range of % mapped reads that you typically encounter and consider acceptable? I was under the impression that anything below 80% is lower than you should expect to see.

Regarding using the B6 reference thing, have you ever compared the results from using a B6 reference vs the proper background references to see if there is any major differences? Is it standard in the field to just opt for the mm39/mm38 compilation over using the lower quality but proper background reference?

Finally when we were doing the library prep we used a ribosomal depletion kit (specifically this one: https://www.neb.com/en-us/products/e7400-nebnext-rrna-depletion-kit-v2-human-mouse-rat) so I was surprised to see that every unmapped read I checked was ribosomal. Are these kits in practice less efficient than they claim?

Again thank you so much for your response, and sorry for all of these questions!!

ADD REPLY • link 4 months ago by karlensberg • 0

1

Entering edit mode

There is no “right” answer for what’s acceptable. The only thing you need to do is figure out “why” some reads aren’t getting mapped. There are many reasons why reads go unmapped. If I have 30% mapping rate because of 70% ribo content in my reads but my downstream analysis is successful, no problem. If I have a 30% mapping rate because I have a human/mouse mixing experiment and I didn’t put the human genome in my reference (only the mouse genome), then I should go back and fix that. If I have a 30% mapping rate because almost everything are adapters, I might want to make sure that I didn’t just select for tiny fragments in my wet lab sample prep. Etc.

In answer to your question, yes I have and there tend to not be major differences (most differences arise from non protein coding regions, where the black6 reference is clearly better). So, I just stick to black6.

For the ribodepletion kit, they aren’t perfect and, as is the case in wet lab experiments, are prone to sample variability. Still good to use — the main advantage of depleting more rRNA out is so that you can get more bang for your buck (reads cost money).

ADD REPLY • link 4 months ago by dsull ★ 7.2k