I work with paired RNA-seq data from 2 conditions in 1 species.
I aligned those data on the reference genome using hisat2
. I am interested on retrieving reads concordantly mapped only once in the genome.
I filtered the SAM output using this post, but here is the problem : They say that ZS
field corresponds to multimapped read. Also in hisat2 documentation they say that field NH:i:
corresponds to number of locations where the read is mapped.
In my data, I observe pairs with the first one having no ZS
field and having NH:i:1
. And the second in pair having ZS
and NH:i:1
. How is it possible ? Then, when I use the filter based on 0x2
and ZS
I keep the first read of the pair, when I would like to keep both or not to keep them at all.
Exemple reads :
K00114:593:HHWHKBBXX:1:2212:10094:37941 99 NC_019867.2 30693395 60 2S77M88N21M = 30693461 120 GTCGACCAGAGGCGTACAGGGACAGCACGGCCTGGATGGCCACATACATGGCAGGAGAGTTGAAGGTTTCAAACATGATCTGTGTCATCTTCTCCCTGTT A<AFFJJFJFFAJJF7F-FFJFFFJJFJFJFFJJJJFJJJAJFJJJJJFJJJJJJJJAFFJJJFAF<JFJJJJJ7JA-AJFFFJAFJJJJJJJJJJFAAA AS:i:-2 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:98 YS:i:0 YT:Z:CP XS:A:- NH:i:1
K00114:593:HHWHKBBXX:1:2212:10094:37941 147 NC_019867.2 30693461 60 11M88N84M = 30693395 -120 TCAAACATGATCTGTGTCATCTTCTCCCTGTTGGCTTTGGGGTTGAGGGGAGCCTCGGTCAGCAGGACTGGATGCTCTTCAGGTGCGACTCTCAG FFJFJJ<FJFJJJFJAFJAJJJAJJAA-JJFJAA<-<-<JJJFJFJFFJJJAJJJJFJA<<A7FFF<AJJFJFJJJJJJJJJFJJJFAJJFAAAA AS:i:0 ZS:i:-16 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:95 YS:i:-2 YT:Z:CP XS:A:- NH:i:1
When I use the first post to parse my SAM am I really getting uniquely concordantly mapped reads ?
You could start filtering the file right after alignment, before you sort by coordinate, and then pipe this into samtools
fixmate
. Given you removed one, but not the other mate, fixmate will correct the FLAG and mark the remaining read as unpaired. When you then pipe this into samtoolsview
and discard unpaired reads, you'll get rid of filtering-induced singletons. The question is if this is worth the efford, depending on your downstream. What do you want to do with the data? Make a count matrix?In that case, I get rid of pairs mapping concordantly once with one read of the pair with multiple alignments ? If a pair has only one concordant mapping I would like to keep it even if one of the read maps somewhere else also.