I need to do a whole genome alignment on a largish metazoan (~1.8 GBp), and the sensible approach seems to be to mask the repetitive alements before doing that so that the number of all versus all seed matches is reduced.
The Lastz whole genome alignment tool has a softmasking option where masked regions aren't used for finding seeds (but can be used when extending seeds to build alignments). That sounds perfect to me.
However, I can't figure out how to specify the softmasked regions. Does someone know how to do this?
The Lastz documentation at http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#fmt_mask says
Sequence Masking File
This file is used with the xmask and nmask actions in a sequence specifier. It consists of one interval per line, without sequence names. Lines beginning with a # are considered to be comments and are ignored, as are blank lines. Only the first two whitespace-delimited words in any line are interpreted as the interval; the rest of the line is ignored.
Each interval describes a region to be masked, and consists of
<start> <end>
Locations are one-based and inclusive on both ends (i.e., they use the origin-one, closed position numbering system). Note that the masking intervals are counted along the forward strand, even if we are only aligning to the reverse complement of the query specifier (i.e. for ‑‑strand=minus).
Here is an example. If the target sequence is hg18.chr1, this would mask the 5' UTRs from several genes. Note that the third column is neither required nor interpreted by LASTZ, and acts as a comment.
884484 884542 NM_015658
885830 885936 NM_198317
891740 891774 NM_032129
925217 925333 NM_021170
938742 938816 NM_005101
945366 945415 NM_198576
1016787 1016808 NM_001114103
1017234 1017346 NM_001114103
1041303 1041486 NM_001114103
My question is - if the Target and Query files (for alignment) have multiple sequences, then how do you specify the masked regions if there is only a <start><end> specification? (you'd also need a <sequenceid>
specifier).
Do you know if there is a way around this or a different tool that will do the same job with masked sequences?
Lastz author here, came across this thread as part of a search for something else. For the sake of completeness... here's the scoop on softmasking and lastz
If you do nothing else, any lowercase nucleotide in your sequence file will be considered as soft-masked.
Additionally, lastz can perform some operations on the sequence after it loads it from memory. One of these is to apply a file of intervals, changing all the indicated bases to lowercase.
As for the masking file not recognizing sequence names, this does come from the design history of originally working only with a single sequence. Recognizing three-column masking files was on my todo list for a while. But at this point, I'm not actively making changes other than bug fixes.
The simplest workaround, rather than concatenating the whole input sequence, would be to pipe the sequence through something that does the softmasking, and then pipe it into lastz (or save it in a temporary file).
Bob H
Thanks for the (potential) confirmation and suggestion for contacting the author. I'll do that