HI all,
I am processing RNA-seq data. Initially, the company used BWA and GRCh37/hg19 as an aligner. I subsequently realigned them using STAR and GRCh38/hg38. Counts are generated using feature counts for both.
Although most of the downstream analysis does not differ using either aligners, I realised that one particular gene in cancer immunology (PD-1) has decreased substantially.
read counts for both for comparison (sorry had to cut out the sample names)
> bwa["PDCD1",]
206 270 5 45 64 68 1486 215 177 59 98 26 70 374 19 112 207 176 267 114 128 207 132 126 150 74 63 105 65 83
> STAR["PDCD1",]
2 2 2 0 2 0 7 0 0 0 0 0 0 1 1 2 0 0 0 0 0 3 1 0 0 2 0 0 0 2
This is quite a popular and well studied genes so I thought I will ask if anyone had similar problems before. Any insight will be helpful as this is a key gene in my field and paper. The rest of the read counts that I've analysed so far have not changed significantly so I feel that there shouldn't be a problem in the alignment process. Anyone notice similar problem in other genes? Is this a biological problem?
Thank you.
Thank you for your comment. The BWA alignment was done by the sequencing company. I am not sure why they did it for RNAseq data, thats why I tried to realign them. Based on the comments below, it sounds like STAR can handle the adapters quite well. We've also performed trimming at the start to get rid of the adapters.
Can I ask that you share your knowledge a bit further with the difference between global alignment and local alignment? Would you recommend another aligner?
Sure. Global alignment (or in the case of read-mapping semi-global / fitting alignment) seeks to find the best alignment that matches _all_ of the read (query) against some substring of the genome (subject). That is, modulo adapter trimming — which can happen only at the ends of reads, semi-global alignment aims to align all of the read. Local alignment, on the other hand, seeks to find the best alignment between some substring of the read with some substring of the genome.
As an illustrative example — consider the following case:
read =
ACACGGTGTGCGTGTCATTCCATC
genome =
CCCATTTGACACGGTCACACGACATTTCCATCGGAAGGAAC
Here, there is no good semi-global alignment of the read against the genome. The best we might do is to align the read at position 8 with some extended CIGAR string that looks like
7M10X7M
--- which would invariably have a poor score. On the other hand, we could find a local alignment of the read against the reference at position 8 with extended CIGAR string7M
and, perhaps, a reasonable score.Which of these behaviors makes more sense depends on your task. For example, local alignment is what is used in BLAST, since we may want to find conserved domains in a larger query, and we may care about the fact that only some part of the query is a good match to the target. However, in read alignment, this is rarely the case, and we are most commonly interested in semi-global alignment, where we want to align all of the read (perhaps allowing clipping at the end-points where there may be adapter contamination).
I think the terminology, across tools, can be inconsistent. The way I like to present is the following. Take
Then an aligner may follow different alignment strategies where the reported alignment is quite different:
global alignment where both sequences are aligned fully. Basically there will be an alignment correspondence for every base of both sequences
reported as:
a semi-global alignment where the gaps at the end of one sequence have a zero cost:
reported as:
a semi-global alignment where gaps at end are zero and the mismatches at the end have low (close to zero) penalty vs mismatches inside the sequence:
reported as:
a local alignment where the highest-scoring substring is reported, which could be:
reported as:
As you see some alignment modes differentiate by the algorithm (global vs local) but within those, we can further differentiate by using different scoring for matches at the end vs matches inside of a sequence.
I believe that the default settings on
bowtie2
use case2.
above.bwa
by default runs as3.
Using the
--local
flag onbowtie2
will make it work more like case3
.