I am trying to build a mappability mask with Heng Li's SNPable program https://lh3lh3.users.sourceforge.net/snpable.shtml . The instructions there are pretty brief and do not detail all the necessary steps, which makes it challenging for introductory bioinformaticians. Suppose the reference genome is genome.fa
, copy-pasting the given instructions, they are:
Extract all overlapping k-mer subsequences as read sequences:
splitfa genome.fa 35 | split -l 20000000
Align all reads to the genome with BWA
bwa aln -R 1000000 -O 3 -E 3 genome.index xxaa > xxaa.sai
Suppose the unsorted BWA alignment results are xx??.sam.gz. Generate rawMask with:
gzip -dc xx??.sam.gz | gen_raw_mask.pl > rawMask_35.fa
Generate the final mask:
gen_mask -l 35 -r 0.5 rawMask_35.fa > mask_35_50.fa
In step 2, genome.index
comes from bwa index
. I believe there are a few undetailed steps to do between steps 2 and 3. When I did step 1, it wrote a series of files xaa xab xac
... xlk xlm
, then step 2 aligned these and wrote the files xaa.sai xab.sai xac.sai
... xlk.sai xlm.sai
. I used bwa samse
to generate sam files. For each sai
file do:
bwa samse genome.index ${prefix}.sai > ${prefix}.sam
I then merged all of these sam files into one big sam file:
samtools merge merged.sam ${allsamfiles}
and then processed steps 3 and 4 with that merged sam file:
gzip -dc merged.sam | gen_raw_mask.pl > rawMask_35.fa
gen_mask -l 35 -r 0.5 rawMask_35.fa > mask_35_50.fa
However, the problems are that this final mask_35_50.fa
is not contiguous, in other words there are lots of different sequences for each chromosome, e.g.
cat mask_35_50.fa | grep ">chr4" | grep -v random | grep -v Un | wc -l # remove scaffolds and contigs with grep -v random and grep -v Un
returns 166, not 1 as I would have hoped.
What is going on here? It seems like I need to sort the merged.sam
file, but i tried this in various ways and things got even worse. Using
samtools sort -@ 4 -m 2G merged.sam -o merged_sorted.sam
then doing steps 3 and 4 with merged_sorted.sam , the final mask was even more fragmented. I also tried sorting by readname
samtools sort -@ 4 -m 2G -n merged.sam -o merged_sortedreadname.sam
but that also did not work.
Does anybody know what I am doing wrong?