Question

How to deal with asterisk in bam file after alignment with STAR

0

Entering edit mode

2.6 years ago

brgs • 0

I have aligned pair-end RNA clip-seq data to human genome, where the output bam file contains some reads like:

AAAAAAAAAC:HWI-D00611:153:C6PBEANXX:5:1309:5483:77560   89  NC_000001.11    51781228    255 38M *   0   0   TTTCATGCGGGAAGGAAAGGATCAGTTGCCAAAAAGCC  <<//BBF<BBFFFFBFFFBFFFFBFBBFF<FFFFFF<F  NH:i:1  HI:i:1  AS:i:37 nM:i:0

I am wondering what the * means in the sequence? As most of the reads have = at that position, and when there is a = there are always two reads with the same head, but when there is a * there is only one read with that head.

(what I mean by head is this part: AAAAAAAAAC:HWI-D00611:153:C6PBEANXX:5:1309:5483:77560)

And I also want to know how to filter how reads with * by using samtools or other tools. Thanks a lot for helping me get out of there.

samtools bam alignment STAR • 1.6k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 2.6 years ago by brgs • 0

score 2 · Accepted Answer · 2022-04-23

2

Entering edit mode

2.6 years ago

Istvan Albert 102k

I believe that the * indicates that the RNEXT (Reference sequence name of the primary alignment of the NEXT read in the template) is not available. Basically, it means that the other read in the pair is not mapped.

You can filter these alignments with:

0x8     8  MUNMAP         next segment in the template unmapped

ADD COMMENT • link 2.6 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks! It works!

ADD REPLY • link 2.6 years ago by brgs • 0

0

Entering edit mode

Sorry, again I would like to ask why this can happen? I searched online but haven't found a good explanation why one of the pair-end read can't be mapped to the genome.

ADD REPLY • link 2.6 years ago by brgs • 0

0

Entering edit mode

a simple and biologically relevant explanation would be a contaminant, take any organism that shares some similarity with your reference, a fragment that originates in a similar region but ends in a dissimilar region will have a broken pair

you could also have some fusions in the sequence, the fused sequences produce fragments that don't quite exists in the reference

at the same time you could also have other weird things happening, more on the measurement or sequencing error side, one of the pairs being deteriorated

ADD REPLY • link 2.6 years ago by Istvan Albert 102k