Question

How to select uniquely and concordantly reads from hisat2 alingment for raw read count

0

Entering edit mode

7.3 years ago

Bioinfonext ▴ 470

I mapped pair end RNAseq read to genome using HISAT2 and got statistics like this:

31026735 (100.00%) were paired; of these:
    3579612 (11.54%) aligned concordantly 0 times
    25541051 (82.32%) aligned concordantly exactly 1 time
    1906072 (6.14%) aligned concordantly >1 times
    ----
    3579612 pairs aligned concordantly 0 times; of these:
      290787 (8.12%) aligned discordantly 1 time
    ----
    3288825 pairs aligned 0 times concordantly or discordantly; of these:
      6577650 mates make up the pairs; of these:
        6068493 (92.26%) aligned 0 times
        475539 (7.23%) aligned exactly 1 time
        33618 (0.51%) aligned >1 times
90.22% overall alignment rate

Now how I can select uniquely and concordantly reads for raw read count using samtools?

RNA-Seq • 7.2k views

ADD COMMENT • link updated 3.7 years ago by MCH ▴ 10 • written 7.3 years ago by Bioinfonext ▴ 470

score 5 · Answer 1 · 2018-01-26

5

Entering edit mode

7.3 years ago

Matteo Schiavinato ★ 3.7k

Ok, let's go through it. The -f 0x2 bitwise flag selection keeps proper pairs, which are read pairs mapped in correct orientation (I suppose illumina reads facing each other) and within insert size (-X and -I) -> concordantly.

Now, if you want the uniquely mapped, you should basically exclude all those reads that have a secondary alignment. From the HISAT2 manual:

ZS:i:    <N>     Alignment score for the best-scoring alignment found other than the alignment reported. Can be negative. Only present if the SAM record is for an aligned read and more than one alignment was found for the read. Note that, when the read is part of a concordantly-aligned pair, this score could be greater than [AS:i].

Basically, if the line contains the ZS: field, it has a secondary alignment, i.e. it is not unique. Filter those out, for example using awk:

samtools view -h -f 0x2 file.bam | awk 'substr($1, 0, 1)=="@" || $0 !~ /ZS:/' | samtools view -h -b > filtered_file.bam

The awk part selects only lines that either start with @ (like the header) or don't contain the ZS: tag, like the reads with secondary alignments (i.e. non-uniquely mapping).

ADD COMMENT • link 7.3 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

Thanks Macspider! I am also interested to use HTseq for read count which only count uniquely mapped read. Could anyone suggest how to proceed with hisat2 sam file to do read count using HTseq?

ADD REPLY • link 7.3 years ago by Bioinfonext ▴ 470

0

Entering edit mode

Documentation is your friend! https://htseq.readthedocs.io/en/release_0.9.1/tour.html

ADD REPLY • link 7.3 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

This is very wrong. You did not understand the meaning of the option ZS. It only reports the alignment score of the secondary alignment which can be less than the primary one. As long as the primary one has a higher score, that alignment is considered uniquely mapped. The only way is to comare AS and ZS together; if they are equal it means that the alignment is actually a multimapper.

ADD REPLY • link 3.7 years ago by MCH ▴ 10

score 0 · Answer 2 · 2019-05-15

0

Entering edit mode

6.0 years ago

al.bodrug ▴ 50

I think like this you're gonna end up with unpaired reads where one of the mates has a ZS tag and the other doesn't and is the only one kept. So they won't be concordant.

ADD COMMENT • link 6.0 years ago by al.bodrug ▴ 50