Very low read counts for PD1 when using STAR aligner compared to BWA
2
2
Entering edit mode
3.1 years ago
emi ▴ 80

HI all,

I am processing RNA-seq data. Initially, the company used BWA and GRCh37/hg19 as an aligner. I subsequently realigned them using STAR and GRCh38/hg38. Counts are generated using feature counts for both.

Although most of the downstream analysis does not differ using either aligners, I realised that one particular gene in cancer immunology (PD-1) has decreased substantially.

read counts for both for comparison (sorry had to cut out the sample names)

> bwa["PDCD1",] 
206        270          5         45         64         68       1486        215        177         59         98         26         70        374         19        112   207        176        267        114        128        207        132        126        150         74         63        105         65         83 
> STAR["PDCD1",]
2          2          2          0          2          0          7          0          0          0          0          0          0          1          1          2     0          0          0          0          0          3          1          0          0          2          0          0          0          2

This is quite a popular and well studied genes so I thought I will ask if anyone had similar problems before. Any insight will be helpful as this is a key gene in my field and paper. The rest of the read counts that I've analysed so far have not changed significantly so I feel that there shouldn't be a problem in the alignment process. Anyone notice similar problem in other genes? Is this a biological problem?

Thank you.

BWA RNAseq PD1 STAR • 2.1k views
ADD COMMENT
1
Entering edit mode
3.1 years ago
Rob 6.9k

It is also worth noting the following two differences:

  1. BWA is performing local alignment while STAR is performing global alignment (along with some optional degree of soft-clipping). These are not the same. Specifically, BWA will return alignments for substrings of the read that match the gene, even when there is no good alignment of the full read, end-to-end, to the gene. Now, the question of why there may be good local alignments but not good global ones is a relevant question. Perhaps there is adapter contamination? Perhaps there some sequence that is locally similar to the target gene but is not otherwise indexed? That's not clear, but one obvious cause for what you see is many reads that align well _locally_ to PD1, but have no good global alignment to that gene.

  2. BWA is not a spliced aligner, while STAR is. In general, I'd not recommend using an unspliced aligner like BWA for RNA-seq alignment against the genome. Possible you could use it to align agains the transcriptome, but I don't think that's what you're doing here as you mention using featureCounts. It's worth looking at some of the reads that BWA aligns to PD1 that STAR does not, but in general, spliced RNA-seq -> genome alignment is not a task for which BWA is designed.

ADD COMMENT
0
Entering edit mode

Thank you for your comment. The BWA alignment was done by the sequencing company. I am not sure why they did it for RNAseq data, thats why I tried to realign them. Based on the comments below, it sounds like STAR can handle the adapters quite well. We've also performed trimming at the start to get rid of the adapters.

Can I ask that you share your knowledge a bit further with the difference between global alignment and local alignment? Would you recommend another aligner?

ADD REPLY
1
Entering edit mode

Sure. Global alignment (or in the case of read-mapping semi-global / fitting alignment) seeks to find the best alignment that matches _all_ of the read (query) against some substring of the genome (subject). That is, modulo adapter trimming — which can happen only at the ends of reads, semi-global alignment aims to align all of the read. Local alignment, on the other hand, seeks to find the best alignment between some substring of the read with some substring of the genome.

As an illustrative example — consider the following case:

read = ACACGGTGTGCGTGTCATTCCATC

genome = CCCATTTGACACGGTCACACGACATTTCCATCGGAAGGAAC

Here, there is no good semi-global alignment of the read against the genome. The best we might do is to align the read at position 8 with some extended CIGAR string that looks like 7M10X7M --- which would invariably have a poor score. On the other hand, we could find a local alignment of the read against the reference at position 8 with extended CIGAR string 7M and, perhaps, a reasonable score.

Which of these behaviors makes more sense depends on your task. For example, local alignment is what is used in BLAST, since we may want to find conserved domains in a larger query, and we may care about the fact that only some part of the query is a good match to the target. However, in read alignment, this is rarely the case, and we are most commonly interested in semi-global alignment, where we want to align all of the read (perhaps allowing clipping at the end-points where there may be adapter contamination).

ADD REPLY
0
Entering edit mode

I think the terminology, across tools, can be inconsistent. The way I like to present is the following. Take

genome: ATGAAAATGC
read:   TAATAAT

Then an aligner may follow different alignment strategies where the reported alignment is quite different:

  1. global alignment where both sequences are aligned fully. Basically there will be an alignment correspondence for every base of both sequences

    ATGAAAATGC
    --.||.||.-
    --TAATAAT-
    

    reported as:

    2D1X2M1X2M1X1D
    
  2. a semi-global alignment where the gaps at the end of one sequence have a zero cost:

    ATGAAAATGC
      .||.||.
      TAATAAT
    

    reported as:

    1X2M1X2M1X
    
  3. a semi-global alignment where gaps at end are zero and the mismatches at the end have low (close to zero) penalty vs mismatches inside the sequence:

    ATGAAAATGC
       ||.||
       AATAA
    

    reported as:

     1S2M1X2M1S
    
  4. a local alignment where the highest-scoring substring is reported, which could be:

     ATGAAAATGC
        ||
        AA
    

    reported as:

      2M
    

As you see some alignment modes differentiate by the algorithm (global vs local) but within those, we can further differentiate by using different scoring for matches at the end vs matches inside of a sequence.

I believe that the default settings on bowtie2 use case 2. above. bwa by default runs as 3.

Using the --local flag on bowtie2 will make it work more like case 3.

ADD REPLY
0
Entering edit mode
3.1 years ago

I think you are going to have to look at alignments in IGV, and track down the individual reads that bwa mapped, and see what STAR did with them.

ADD COMMENT
0
Entering edit mode

to expand on this for the OP,

none of the short read aligners perform an optimal alignment, instead there is heuristics in place that will ignore alignments that seem far off being a good match.

depending on how the heuristics work and how soon a method "gives up", you could end up with radically different results.

in this case, it is likely that the reads do not align well, and have certain properties, for example, adapters (not sure if STAR can handle those) that make it unlikely to work in one case versus the other.

ADD REPLY
0
Entering edit mode

STAR is actually pretty good with removing adapters to make the alignment. I had a run of 10x where an enzyme was failing to cleave some adapter in the first 10 bases or so, and STAR was happy to soft trim the first 10 bases off and align the rest of the read properly. Trimming the fastq of the first 10 bases barely improved the alignment.

ADD REPLY
0
Entering edit mode

Thank you for your comment. If the reads have been contaminated with adapters, would we be seeing the effect across all genes? One of the things that make me scratch my head is that it seems 'targetted' and only a few genes are affect. My PI (not bioinformatician) has suggested whether alternative splicing for the gene is causing this. My impression is that splice aware aligner are designed to handle alternative splicing better than non-splice aware aligners, not the other way round.

ADD REPLY
0
Entering edit mode

Thank you for your comment. I have not used IGV before but happy to learn. Will doing this inform me how well aligned/not well aligned the reads are and figure whether BWA has included genes that are not well aligned and STAR has done the opposite? In either case, any other suggestions on whether trying another aligner might be helpful? Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6