Entering edit mode
8.0 years ago
User000
▴
710
Hello,
I have a genome with homologous pairs of chromosome. I am using hisat2 to map RNA-seq reads to genome. I want to do variant calling with GATK, but before I want to extract the reads that mapped uniquely from BAM file(not SAM). Could you suggest how to do that? From other threads seems that flag NH:i:1 means mapped once, but how can I grep it from bam file? and is it a right way?
HISEQ1:128:C0HUMACXX:6:2210:15441:67674 163 chr1A 1101543 60 59M = 1104438 626 CAATCTTCTCGTGGTTTATTGGCGAATGGAGTTTCAGAAGATGCTCAAAATGGTTTCCA @BCFFDDEFHDDFGEEEDHEHCBHFD<3D3C*?D:B?:B?B*98D?4?BB=CG)=FFHA AS:i:0 ZS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:59 YS:i:0 YT:Z:CP NH:i:1
HISEQ1:130:C0G55ACXX:1:1203:2254:30173 163 chr1A 1101543 60 101M = 1101599 155 CAATCTTCTCGTGGTTTATTGGCGAATGGAGTTTCAGAAGATGCTCAAAATGGTTTCCATGATGCTGATCGAACTGTCCGCAGAGGTGAGGGGCCATCTAA CCCFFFFFHHHHHJGIJJJJJJJJJJJJJGIJIJJIJIJJJJJJJJJJJJJJJIIJJJJJJJJJJJJHHHHHFFFFEEEDDDDDDD@BDDDDDDDDDDDDD AS:i:0 ZS:i:-15 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:-2 YT:Z:CP NH:i:1
thank you, may I pose the question in another way also:;since i have these homology i want to optimize parameters to map reads uniquely to the different subgenomes..so I either optimize parameters of mapping with hisat2 or filter the bam file...any suggestion is much appreciated.
There are many allele-specific alignment and processing packages already made, don't reinvent the wheel.
can you suggest me how to grep "NH:i:1" reads from bam? thanks
That's much slower and will produce the same results. Don't waste your time.
another question is: why with the command line above my new bam file is 4 times bigger than the previous one?
Because I didn't include the -b option, mea culpa.
I see, thank you very much! Why did you use -q 5? also, do you think it is ok to filter unique reads after doing mark dups step?
I don't understand why mapping quality >= 5 means uniquely mapped :(
It's an undocumented "feature" of hisat2/topat (and STAR, in case you ever need to use that) that MAPQ values of 0-3 are used for multimappers and a higher value (255 or 50) is used for non-multimappers.