Downloading BLAT
To get BLAT source code:
$ mkdir /tmp/blat && cd /tmp/blat
$ wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc.zip
$ unzip blatSrc.zip
Patching (optional)
I decided to make blat
a static binary to avoid missing foo.so
shared library errors. Here's a patch you can use to modify the blat
makefile:
$ cat >> static-blat-makefile.patch
3c3
< L += -lm $(SOCKETLIB)
---
> L += -lm -ldl $(SOCKETLIB) -static-libgcc
10c10
< ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O $(MYLIBS) $L
---
> ${CC} ${COPT} ${CFLAGS} -o ${DESTDIR}${BINDIR}/blat $O -static $(MYLIBS) $L
You may need static library packages installed on your system. The names of these packages will depend on your version of Linux.
Then apply the patch:
$ cd /tmp/blat/blatSrc/blat
$ cp makefile makefile.original
$ patch makefile.original -i ../../static-blat-makefile.patch -o makefile
You may decide not to apply this patch. You could probably skip this step. I just don't like dynamic binaries.
Building BLAT
In any case, you will want to go to the top level of the blatSrc
directory and run make
to build the kit:
$ cd /tmp/blat/blatSrc && make
This will take a few minutes to build binaries.
Installing BLAT
To install them into ${HOME}/bin/${MACHTYPE}
, run:
$ make install
This destination is a subdirectory of your home directory.
Once it is built and installed, you can copy the binary to /usr/local/bin
or somewhere in your shell's PATH
that makes sense to you. For me, my ${MACHTYPE}
is x86_64
and I like having binaries in /usr/local/bin
:
$ sudo cp ~areynolds/bin/x86_64/blat /usr/local/bin/blat
Adjust this to the particulars of your setup.
Downloading genomes
Once you have installed blat
, the next step is to download a FASTA file for your genome of interest. If you wanted hg38
, for instance:
$ for chr in `seq 1 22` X Y; do echo $chr; wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr$chr.fa.gz | gunzip -c - >> hg38.fa; done
Optimizing queries
Once you have this file hg38.fa
, you can start doing queries, but it can help speed up searches if you first make an OOC file:
$ blat /path/to/hg38.fa /dev/null /dev/null -makeOoc=/path/to/hg38.fa.11.ooc -repMatch=1024
When you do searches, you'd pass this OOC file as an option to skip over regions with over-represented sequences.
Querying
Once you have this OOC file made, you can do searches with your FASTA file containing sequences of interest:
$ blat /path/to/hg38.fa /path/to/your-sequences.fa -ooc=/path/to/hg38.fa.11.ooc search-results.psl
The blat
binary will write any search results to a PSL-formatted text file called search-results.psl
. You can name this whatever you want.
The PSL format is described on the UCSC site.
Parallelization
If you have very many sequences, you can parallelize this by simply splitting up your input sequences file into smaller pieces, and running multiple blat
processes, one process for each piece of your original sequences file, writing many PSL files as output.
Set operations
It can help to use a tool like BEDOPS psl2bed
to convert PSL to a BED file to do set operations, but that depends on what you want to do with the results. In any case, to convert a PSL file to a sorted BED file:
$ psl2bed < search-results.psl > search-results.bed
How long are your query sequences? You could set up a local BLAT server and run things pretty quickly that way. No need to use Python.
200bp Wouldn't I need to download the entire human genome? I have the space, actually. I just have no idea how to go about it. That sounds ideal though.
If you insist in using Python for this you should have a look at the multiprocessing module for parallelizing the blast searches.