I generated BAM files by aligning human RNA reads against the human reference genome using BWA MEM and TopHat2. I would like to know how I can retain only reads that mapped exclusively to a single site in the reference genome with one or zero mismatches for downstream analysis.
Thanks a lot for your help, Ivstan.
Just for understanding purposes, what is the difference between
samtools sort
andsamtools view -h chrom:start-end
? Can I intersect the BAMs with the chrom coordinate on already sorted BAMS?the BAM file needs to be sorted for the intersect to work, I forgot to mention that explicitly,
I assumed it was sorted because, as a rule, in general, all bam files should be always sorted by position.
Samtools tells me that:
Should I perform
samtools index -b input.bam
beforehand and store the index in the same directory as the input.bam?Would the extraction of reads that fall within a specified region (e. g. chr1) not also include multiple mapped reads?
some further filtering is necessary, look into what aligner you used and how that aligner marks the multimapped reads,
if it is bwa you can add
-q 0
(the quality is set to zero for multimapped reads) or the presence of XA tags.There is some variability in how aligners represent multimapped reads and that needs to be dealt with accordingly