I have aligned pair-end RNA clip-seq data to human genome, where the output bam file contains some reads like:
AAAAAAAAAC:HWI-D00611:153:C6PBEANXX:5:1309:5483:77560 89 NC_000001.11 51781228 255 38M * 0 0 TTTCATGCGGGAAGGAAAGGATCAGTTGCCAAAAAGCC <<//BBF<BBFFFFBFFFBFFFFBFBBFF<FFFFFF<F NH:i:1 HI:i:1 AS:i:37 nM:i:0
I am wondering what the *
means in the sequence? As most of the reads have =
at that position, and when there is a =
there are always two reads with the same head, but when there is a *
there is only one read with that head.
(what I mean by head is this part: AAAAAAAAAC:HWI-D00611:153:C6PBEANXX:5:1309:5483:77560
)
And I also want to know how to filter how reads with *
by using samtools or other tools. Thanks a lot for helping me get out of there.
Thanks! It works!
Sorry, again I would like to ask why this can happen? I searched online but haven't found a good explanation why one of the pair-end read can't be mapped to the genome.
a simple and biologically relevant explanation would be a contaminant, take any organism that shares some similarity with your reference, a fragment that originates in a similar region but ends in a dissimilar region will have a broken pair
you could also have some fusions in the sequence, the fused sequences produce fragments that don't quite exists in the reference
at the same time you could also have other weird things happening, more on the measurement or sequencing error side, one of the pairs being deteriorated