Question

Too many unmapped reads - STAR alignment

1

Entering edit mode

4.5 years ago

codsorre ▴ 10

Hello all,

Probably a simple fix but we all have to start somewhere. I am trying to figure out how to align reads to a transcriptome (Trinity generated) using STAR and currently doing some troubleshooting. I ran an alignment with just one of my samples (sample was included in generated transcriptome). The Average input read length was 141 (which intuitively to me should not lead to 99% reads being too short as the output says). These were originally 150bp sequenced.

First was to build the index

Slurm command = ... wrap="STAR --runThreadN 20 --runMode genomeGenerate --genomeDir ...path_to_index --genomeFastaFiles Trinity.fasta --genomeSAindexNbases 14"

Then to align

Slurm command = ... --wrap="STAR --readFilesCommand zcat --readFilesIn <in1> <in2> --genomeDir <.../index> --runThreadN 20 --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within"

Any thoughts? Could this be a problem with building indices or the actual alignment?

                          Number of input reads |   42852270
                      Average input read length |   141
                                    UNIQUE READS:
                   Uniquely mapped reads number |   217
                        Uniquely mapped reads % |   0.00%
                          Average mapped length |   126.74
                       Number of splices: Total |   0
            Number of splices: Annotated (sjdb) |   0
                       Number of splices: GT/AG |   0
                       Number of splices: GC/AG |   0
                       Number of splices: AT/AC |   0
               Number of splices: Non-canonical |   0
                      Mismatch rate per base, % |   7.60%
                         Deletion rate per base |   0.00%
                        Deletion average length |   0.00
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.00
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   267726
             % of reads mapped to multiple loci |   0.62%
        Number of reads mapped to too many loci |   24294
             % of reads mapped to too many loci |   0.06%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   42555398
                 % of reads unmapped: too short |   99.31%
                Number of reads unmapped: other |   4635
                     % of reads unmapped: other |   0.01%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

alignment • 4.3k views

ADD COMMENT • link updated 3.9 years ago by jdmontenegroc • 0 • written 4.5 years ago by codsorre ▴ 10

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

Also, add relevant tags so people can find your questions more easily. STAR and Trinity are relevant tags here, but the only tag added is alignment, which is too generic and not at all helpful. Please invest a decent amount of effort in your question.

ADD REPLY • link 4.5 years ago by Ram 44k

1

Entering edit mode

Note that when STAR says "too short" it doesn't literally mean that. It just means it didn't map. Are you totally sure this is the right reference?

ADD REPLY • link 4.5 years ago by swbarnes2 14k

0

Entering edit mode

Try to map with a different tool to see if you get different results. Since you're mapping to a transcriptome, you could try BWA.

ADD REPLY • link 4.5 years ago by alex.zaccaron ▴ 470

0

Entering edit mode

Did you use BUSCOs for transcriptome assessment ? You can see whether the trinity fasta file is assembled well using BUSCOs. Moreover, you can use the other samples to align against reference. Those may give some ideas.

ADD REPLY • link 4.5 years ago by young_bioinformatician ▴ 240

0

Entering edit mode

What have you done before aligning the reads to the reference genome?

ADD REPLY • link 3.9 years ago by DareDevil ★ 4.3k

score 0 · Answer 1 · 2021-01-22

I am pretty sure that alignment of RNAseq reads to a trasncriptome denovo assembly would be much more efficient with a regular aligner like BWA or bowtie2. This is because the transcriptome should have little to none intronic sequence, so there is no benefit in mapping with a gap-aware aligner like STAR. Second, I am not sure what steps you followed for trinity assembly. It is usually suggested to run at least one round of sequence clustering (cd-hit is a good choice) to make sure you select one or maybe two isoforms of the same gene, having too many alternative isoforms in your reference, will cause STAR to drop the seeds because these are too repetitive and thus increase the percentage of unmapped reads. Finally, I would suggest to trim low quality bases from your raw reads before mapping, I recently had a project where nearly 70% of the reads were unmapped, because the seeds were too short, after trimming, the unmapped percentage dropped to <1%. Well, also nearly 50% of all the reads were discarded after trimming, so it was a crappy library to begin with. Anyway, hope this helps. Regards