Multiple Hits and Unique Mapped Reads
1
1
Entering edit mode
8.1 years ago
fusion.slope ▴ 250

Hi All,

am trying to use different softwares for my new sequencing protocol and am comparing 3 different alignment tools:

  • BWA
  • Bowtie2
  • Segemehl

Now am on the step in which i want to retrieve unique mapped reads and as in the community has been already explained different alignment tools have different solution. For example:

  • BWA uniquely mapped reads are retrieved selecting Quality Score > 0 and removing unmapped reads;
  • Bowtie2 uniquely mapped reads are retrieved removing all the reads with MAPQ < 1 as well as removing the unmapped reads and not primary aligned

For Segemehl i can't use these information since in the SAM format are not present the flags in wihich i can retrieve these information but i can align using the parameter multiple hits (-M). So can I use this parameter -M to 1 in order to extract unique mapped reads? Does Multiple Hits means that the reads are aligned in several regions of the Genome, so that -M > 2 means that the reads are mapped in different regions of the genome and then are not uniquely mapped?

Thanks in advance for any suggestions.

Cheers!

alignment sequencing Multiple Hits • 5.8k views
ADD COMMENT
1
Entering edit mode

Your title is wrong. Multiple hits, not hist :)
Hist looks like histogram

ADD REPLY
0
Entering edit mode

thanks chen i have corrected :))

ADD REPLY
1
Entering edit mode
8.1 years ago
fusion.slope ▴ 250

The NH:i:1 will allow to retrieve the uniquely mapped reads from Segemehl output.

doing:

grep "NH:i:1" myfile.sam > myfile.uniq.sam

will allow to retrieve uniquely mapped reads. I did not know before the NH:i information in SAM format.

Here at page 6 the explanation

http://epbi-radivot.cwru.edu/EPBI473/files/week7seqData/bak/SAMv1.pdf

NM:i Number of reported alignments that contains the query in the current record

ADD COMMENT
2
Entering edit mode

Just a couple of notes... grep "NH:i:1" will match also NH:i:10 or NH:i:11 etc.

I don't know Segemehl but NH can be greater than 1 and still you can have a "uniquely" mapped read. For example if one hit has zero mismatches and another hit has several mismatches than NH= 2 but the the second hit is so poor compared to the first one that you can say you have a unique mapping. In fact, the concept of "uniquely mapped" is a bit misleading as you can align any read anywhere you want, you just need to allow enough edits (mismatches), for this reason mapq is preferable.

ADD REPLY
0
Entering edit mode

thanks dariober, here the correct script then:

awk -F"\t" '$14 == "NH:i:1" { print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11"\t"$12"\t"$13"\t"$14"\t"$15}' segemehl_output.sam > segemehl_output.uniq.sam

thanks also for the second points about the mapq l

ADD REPLY
4
Entering edit mode

I don't like this solution very much. You rely on the NH tag to be in position 14 but the sam format doesn't enforce that. You can still use grep as grep -w 'NH:i:1', this should be fine. Or even better a proper parser for sam format like python pysam, but that requires a bit more coding.

ADD REPLY
0
Entering edit mode

ok thanks, you are right about the position 14 but for Segemehl it is in the 14th column the NH:i information in the SAM file. For a general purpose is not a good option i agree..

ADD REPLY
0
Entering edit mode

Is read counting on this extracted unique alignments same as not using countMultiMappingReads in featureCounts?

ADD REPLY

Login before adding your answer.

Traffic: 2557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6