Question

RNA STAR and trimming

1

Entering edit mode

11 months ago

casio ▴ 10

Hi,

I came across the analysis of RNA-seq data, and the trimming of it makes me wonder a bit. We would usually trim the first 10 - 15 base pairs given, the graphic. However, with this, RNA STAR maps only to ~60 % unique reads. If we remove the trimming of the first 10 - 15 base pairs, the unique mapped reads are above 90%! Even if we remove just 1 base pair in the trimming, the quality of the unique mapped reads drops down to 64%. Can someone explain to me why even a removal of 1 basepair changes the result that much? Given how mappers work, with k-mers and suffix arrays, I am surprised the removal of 1 basepair changes the result that much.

Also, why do we have such a non-uniform distribution at the reads for the first 10 - 15 base pairs? What makes them so special that they are actually very important?

Thanks a lot!

enter image description here

trimming STAR RNA-seq RNA • 2.9k views

ADD COMMENT • link updated 11 months ago by GenoMax 151k • written 11 months ago by casio ▴ 10

0

Entering edit mode

Could be helpful if you provide the STAR output summary (e.g. after trimming, what do these reads then get classified as?)

ADD REPLY • link 11 months ago by rfran010 ★ 1.5k

0

Entering edit mode

See my other comment, but also the full statistics: STAR output for removingno base pair with fastp (-f 0 and -F 0):

                                Started job on |    Jul 05 10:03:08
                             Started mapping on |   Jul 05 10:08:06
                                    Finished on |   Jul 05 10:15:53
       Mapping speed, Million of reads per hour |   180.85

                          Number of input reads |   23460158
                      Average input read length |   282
                                    UNIQUE READS:
                   Uniquely mapped reads number |   21643662
                        Uniquely mapped reads % |   92.26%
                          Average mapped length |   281.70
                       Number of splices: Total |   24210914
            Number of splices: Annotated (sjdb) |   23968959
                       Number of splices: GT/AG |   23972638
                       Number of splices: GC/AG |   180247
                       Number of splices: AT/AC |   21369
               Number of splices: Non-canonical |   36660
                      Mismatch rate per base, % |   0.28%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.71
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.60
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   1426413
             % of reads mapped to multiple loci |   6.08%
        Number of reads mapped to too many loci |   10796
             % of reads mapped to too many loci |   0.05%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   372783
                 % of reads unmapped: too short |   1.59%
                Number of reads unmapped: other |   6504
                     % of reads unmapped: other |   0.03%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

STAR output for removing one 1 base pair with fastp (-f 1 and -F 1):

                       Started job on | Jul 05 10:33:30
                             Started mapping on |   Jul 05 10:37:34
                                    Finished on |   Jul 05 10:43:59
       Mapping speed, Million of reads per hour |   219.37

                          Number of input reads |   23460525
                      Average input read length |   280
                                    UNIQUE READS:
                   Uniquely mapped reads number |   15213991
                        Uniquely mapped reads % |   64.85%
                          Average mapped length |   290.67
                       Number of splices: Total |   17862022
            Number of splices: Annotated (sjdb) |   17683626
                       Number of splices: GT/AG |   17680639
                       Number of splices: GC/AG |   136156
                       Number of splices: AT/AC |   16161
               Number of splices: Non-canonical |   29066
                      Mismatch rate per base, % |   0.34%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.72
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.62
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   948590
             % of reads mapped to multiple loci |   4.04%
        Number of reads mapped to too many loci |   8677
             % of reads mapped to too many loci |   0.04%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   7282948
                 % of reads unmapped: too short |   31.04%
                Number of reads unmapped: other |   6319
                     % of reads unmapped: other |   0.03%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

STAR output for removing 12 base pair with fastp (-f 12 and -F 12):

                                 Started job on |   Jul 05 10:20:34
                             Started mapping on |   Jul 05 10:35:07
                                    Finished on |   Jul 05 10:40:38
       Mapping speed, Million of reads per hour |   254.89

                          Number of input reads |   23435619
                      Average input read length |   258
                                    UNIQUE READS:
                   Uniquely mapped reads number |   12736155
                        Uniquely mapped reads % |   54.35%
                          Average mapped length |   270.76
                       Number of splices: Total |   13915258
            Number of splices: Annotated (sjdb) |   13774701
                       Number of splices: GT/AG |   13771792
                       Number of splices: GC/AG |   107736
                       Number of splices: AT/AC |   12870
               Number of splices: Non-canonical |   22860
                      Mismatch rate per base, % |   0.35%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.70
                        Insertion rate per base |   0.01%
                       Insertion average length |   1.58
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   828706
             % of reads mapped to multiple loci |   3.54%
        Number of reads mapped to too many loci |   8535
             % of reads mapped to too many loci |   0.04%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   9856325
                 % of reads unmapped: too short |   42.06%
                Number of reads unmapped: other |   5898
                     % of reads unmapped: other |   0.03%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

Output for TopHat2 on non-removal:

Left reads:
          Input     :  23460158
           Mapped   :  21355015 (91.0% of input)
            of these:   1985724 ( 9.3%) have multiple alignments (17349 have >20)
Right reads:
          Input     :  23460158
           Mapped   :  21093269 (89.9% of input)
            of these:   1963051 ( 9.3%) have multiple alignments (17423 have >20)
90.5% overall read mapping rate.

Aligned pairs:  20084666
     of these:   1877792 ( 9.3%) have multiple alignments
                  404733 ( 2.0%) are discordant alignments
83.9% concordant pair alignment rate.

Output for TopHat2 on 12 base-pair removal:

Left reads:
          Input     :  23435619
           Mapped   :  21901558 (93.5% of input)
            of these:   2311245 (10.6%) have multiple alignments (186741 have >20)
Right reads:
          Input     :  23435619
           Mapped   :  21450134 (91.5% of input)
            of these:   2272332 (10.6%) have multiple alignments (186712 have >20)
92.5% overall read mapping rate.

Aligned pairs:  20790678
     of these:   2207501 (10.6%) have multiple alignments
                  703626 ( 3.4%) are discordant alignments
85.7% concordant pair alignment rate.

ADD REPLY • link 11 months ago by casio ▴ 10

score 1 · Answer 1 · 2024-07-10

1

Entering edit mode

11 months ago

dsull ★ 7.6k

OK, I think I have it figured out.

Looking at a few reads, the paired end reads are "dovetailing".

Let's say you have 10-bp long reads and you have already trimmed them (by 1 bp) so that they are 9-bp long. The R1 read of TGGTGAGGT and its mate R2 is CCTCACCAG (the reverse complement of R2 is CTGGTGAGG).

That means your alignment would look something like this:

 R1:  TGGTGAGGT
 R2: CTGGTGAGG

Notice the overhang and how the mates extend past each other (i.e. dovetail)? STAR will report this as unaligned.

Now, why does trimming affect these results? Well, in this case, if you hadn't trimmed off that 1 bp, the mates wouldn't extend past each other! (In a lot of cases, I'm seeing these 1 bp overhang results).

I aligned your BAM file of the STAR-unaligned reads with hisat2 and then used samtools to give me the insert size distribution. The vast majority of records had an insert size of 0 (again, this is due to what I described above).

What's the solution? You have to tell STAR to align these by using either the --alignEndsProtrude ConcordantPair option or, perhaps better yet, the --peOverlapNbasesMin option. Using these options, you can specify the allowable overhang or overlap.

In any case, the reason your reads look like this is due to a library prep issue in the lab.

ADD COMMENT • link 11 months ago by dsull ★ 7.6k

0

Entering edit mode

Good detective work. If the paired-end reads are overlapping to that extent then this must be a fairly short inset library.

ADD REPLY • link 11 months ago by GenoMax 151k

0

Entering edit mode

Wow, thank you so much!

For testing, I trimmed now on tail 75 bp away, and the unique read alignment is up to 88%, even if we trim the head 12 too. Which means it confirms your analysis.

What could be a cause for this in the library prep? A too short fragment size in respect to the read length?

ADD REPLY • link 11 months ago by casio ▴ 10

0

Entering edit mode

What could be a cause for this in the library prep? A too short fragment size in respect to the read length?

If these are FFPE samples then that is normal(ish). It could also be that the starting material was over fragmented and/or the library has not been treated to remove short fragments. Short fragments cluster preferentially.

ADD REPLY • link 11 months ago by GenoMax 151k

score 0 · Answer 2 · 2024-07-05

0

Entering edit mode

11 months ago

GenoMax 151k

why do we have such a non-uniform distribution at the reads for the first 10 - 15 base pairs?

Please see the blog post by authors of FastQC located here: https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/

If we remove the trimming of the first 10 - 15 base pairs, the unique mapped reads are above 90%!

That is odd. What do you mean by "unique mapped" reads?

ADD COMMENT • link 11 months ago by GenoMax 151k

0

Entering edit mode

I think they mean the uniquely mapped reads that is outputted in the log final at the end of a STAR run.

But that is indeed very odd behavior. I think the best way to diagnose is to literally just use samtools to view the output BAM files and investigate what reads are being mapped in one scenario but unmapped in the other scenario (and see exactly where in the read the mapping is occurring and see if it exists in the genome that was indexed).

ADD REPLY • link 11 months ago by dsull ★ 7.6k

0

Entering edit mode

Thank you for your reply. I will specify my actions a bit more:

Trimming with fastp and no hard removal of basepairs: fastp --thread 8 --report_title 'fastp report for L1_R1_001_fastq_gz.fastq.gz' -i 'L1_R1_001_fastq_gz.fastq.gz' -o first.fastq.gz -I 'L1_R2_001_fastq_gz_R2.fastq.gz' -O second.fastq.gz --detect_adapter_for_pe -f 0 -F 0
Trimming with fastp and hard removal of 12 basepairs: fastp --thread 8 --report_title 'fastp report for L1_R1_001_fastq_gz.fastq.gz' -i 'L1_R1_001_fastq_gz.fastq.gz' -o first.fastq.gz -I 'L1_R2_001_fastq_gz_R2.fastq.gz' -O second.fastq.gz --detect_adapter_for_pe -f 12 -F 12

Using RNA STAR for mapping against hg38 with the following commands: STAR --runThreadN 8 --genomeLoad NoSharedMemory --genomeDir 'hg38_index' --sjdbOverhang 100 --sjdbGTFfile hg38_knownGenes.gtf --sjdbGTFfeatureExon 'exon' --readFilesIn forward.fastq.gz reverse.fastq.gz --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate --twopassMode None '' --quantMode - --outSAMattrIHstart 1 --outSAMattributes NH HI AS nM ch --outSAMprimaryFlag OneBestScore --outSAMmapqUnique 60 --outSAMunmapped Within --outBAMsortingThreadN 8 --outBAMsortingBinsN 50 --winAnchorMultimapNmax 50 --limitBAMsortRAM 1000000 && samtools view -b -o mapped.bam Aligned.sortedByCoord.out.bam

This leads in the first case to a unique mapped reads of RNA STAR of Uniquely mapped reads % | 92.26% and in the second case to Uniquely mapped reads % | 54.35%. If we trim just 1 base pair, we get Uniquely mapped reads % | 64.85%.

In the trimming cases, we get e.g. for removal of 12 base pairs we get also a very high reads too short rate Number of reads unmapped: too short | 9856325 % of reads unmapped: too short | 42.06%

For control, we used also TopHat2 on the no hard threshold data vs hard threshold of 12 with the command: tophat2 --num-threads 8 -r 300 --mate-std-dev=20 bowtie2_index/hg38/hg38 forward.fastq.gz reverse.fastq.gz. This leads to 83.9% concordant pair alignment rate. for the non-hard trimming case, and 85.7% concordant pair alignment rate. in case of the 12 base pair removal.

This gives the insight, that trimming the 'noise' seems to work, and improves the mapping for TopHat2. However, it does not give me any idea why RNA STAR acts so differently.

I hope this additional data might give you details to help to explain this behaviour.

Thanks a lot!

Disclaimer: Data was processed with my own Galaxy instance, and I edited the commands for better readability.

ADD REPLY • link 11 months ago by casio ▴ 10

0

Entering edit mode

Please use hisat2 if you wish to use an alternate aligner for comparison with STAR. Tophat2 is an older aligner that is deprecated.

Note from the TopHat devs says:

Please note that TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way.

As noted by @dsull above you should check the alignments and see why reads are uniquely mapping in one instance and not other.

ADD REPLY • link 11 months ago by GenoMax 151k

0

Entering edit mode

Thank you for this advice.

Even if Tophat2 is an older aligner and is deprecated, I don't think this gives us an explanation why RNA STAR behaves in the way it does vs. why we don't see this effect with Tophat2.

We want to use RNA STAR in production, and Tophat2 is only referred to as an alternative to investigate if the drop of quality in the alignment comes from the data or the mapper.

ADD REPLY • link 11 months ago by casio ▴ 10

0

Entering edit mode

Since we can't see your data it is hard to provide a logical explanation. It could be due to one of the additional alignment options you are using with STAR. Can you remove those and try?

ADD REPLY • link 11 months ago by GenoMax 151k

0

Entering edit mode

RNA STAR parameters: Besides some other output parameters for the SAM/BAM files, the parameters given here are default values of RNA STAR. I confirmed this with the official documentation. https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

Hisat2:

I used now also hisat2. The following is the result of Trimming with fastp and no hard removal of basepairs as given above:

23460158 reads; of these:
  23460158 (100.00%) were paired; of these:
    933198 (3.98%) aligned concordantly 0 times
    20953931 (89.32%) aligned concordantly exactly 1 time
    1573029 (6.71%) aligned concordantly >1 times
    ----
    933198 pairs aligned concordantly 0 times; of these:
      37659 (4.04%) aligned discordantly 1 time
    ----
    895539 pairs aligned 0 times concordantly or discordantly; of these:
      1791078 mates make up the pairs; of these:
        1283064 (71.64%) aligned 0 times
        469095 (26.19%) aligned exactly 1 time
        38919 (2.17%) aligned >1 times
97.27% overall alignment rate

And this for Trimming with fastp and hard removal of 12 basepairs as given in detail above:

23435619 reads; of these:
  23435619 (100.00%) were paired; of these:
    11251629 (48.01%) aligned concordantly 0 times
    11322054 (48.31%) aligned concordantly exactly 1 time
    861936 (3.68%) aligned concordantly >1 times
    ----
    11251629 pairs aligned concordantly 0 times; of these:
      7037784 (62.55%) aligned discordantly 1 time
    ----
    4213845 pairs aligned 0 times concordantly or discordantly; of these:
      8427690 mates make up the pairs; of these:
        1208921 (14.34%) aligned 0 times
        5646578 (67.00%) aligned exactly 1 time
        1572191 (18.66%) aligned >1 times
97.42% overall alignment rate

The used hisat2 command in version 2.2.1

hisat2  -p 8  -x 'isat2_index/hg38/hg38'    -1 'input_f.fastq.gz' -2 'input_r.fastq.gz' --summary-file summary.txt

ADD REPLY • link 11 months ago by casio ▴ 10

0

Entering edit mode

Please use samtools to check a few reads that align with one STAR setting but not the other. Post a couple of those reads here, and I can check.

ADD REPLY • link 11 months ago by dsull ★ 7.6k

0

Entering edit mode

Hi,

thanks for the offer. I extracted all reads which do no align, and prepared a BAM file with it. https://drive.google.com/drive/folders/181QP93bskLka5WXnKQ1PlvFRykXORNnR?usp=sharing

The data this is based on, is trimmed with fastp with -f 1 and -F 1 and leads to a unique mapping of 65%.

Thanks you for your help.

ADD REPLY • link 11 months ago by casio ▴ 10

Ram · Answer 3 · 2024-07-07

0

Entering edit mode

11 months ago

Malcolm.Cook ★ 1.5k

I expect the issue is result of your unspecified trimming process. You probably didn't really need to trim. It sounds like you are not using options --clip3pNbases or --clip5pNbases. I think you should consider to do so if you are intent upon dialing in a fixed number of bases to ignore in the alignment.

ADD COMMENT • link updated 11 months ago by Ram 45k • written 11 months ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Dear Malcolm,

please see my other comment. I added the exact commands I used for the trimming and alignment. In comparison to RNA STAR, TopHat2 does not have any issues with my trimming.

Even though I do not need trimming, I would like to know why even the trimming of one base pair, leads to a drop in alignment rate from 92% to 64%, trimming 12 base pairs leading to only 54% alignment rate. At the same time TopHat has 83% for non-trimmed and 85% for 12 base pair trimmed data.

Why is RNA STAR behaving like this?

Thanks a lot!

ADD REPLY • link 11 months ago by casio ▴ 10