If you want to automate this, you might try something like the following untested script.
Let's assume you have a sort-bed
-sorted BED file called geneTSSs.bed
that contains the coordinates of TSSs.
Let's also say you have a tabix-indexed bgzip file called hg19.gz
and the associated tabix index file hg19.gz.tbi
. The bgzip file is a compressed BED file containing four columns: chromosome, start, stop and an ID field containing the sequence data for that genome build, for that specified genomic range.
First, use bedops
to get the BED regions that lie +/-500 bases around each TSS.
Second, pass each of those regions to awk
to make a search variable called region
, which is passed to tabix
to search on hg19.gz
. Each search result is appended to the file answer.fa
:
$ bedops --range 500 --everything geneTSSs.bed \
| awk '{ \
region=$1":"$2"-"$3; \
system("echo \>" region " >> answer.fa"); \
system("tabix hg19.gz " region " | cut -f4 >> answer.fa"); \
}' -
There's a clean-up step I haven't shown. Depending on the k-mer size of the sequences in hg19.gz
, you may have some post-processing to do on each sequence in answer.fa
, trimming away sequence from the beginning and end.
For instance, let's say that you have 1000-mers in hg19.gz
. A sample region is chr2:1230-2650
. The tabix search returns sequence data from chr2:1000-3000
. So on the cleanup pass on answer.fa
, we need to cut the first 230 bases and the last 350 bases. I haven't shown code for that, but that should be easy to whip up in Perl or Python.
Ref: The BEDOPS suite offers bedops
and sort-bed
. The samtools suite offers bgzip
and tabix
.
Solid suggestion. I can definitely write a script to do it, I was just wondering if there was an interface that existed that could do it. I guess it was more of a theoretical question. Thanks though!