Your sequence is too short to use with BLAT or BLAST, I think.
Here's one grep
-based approach you can parallelize on a per-chromosome basis.
This query ran in a second or two, for me. I think that's pretty fast.
We can look up NOTCH1 and we know it is found on chr9
.
For demonstration purposes, I'll download the FASTA sequence for that chromosome for genome hg38
, but the principle is the same for other chromosomes and genome assemblies:
$ wget -qO- https://hgdownload.cse.ucsc.edu/goldenpath/hg38/chromosomes/chr9.fa.gz | gunzip -c > chr9.fa
This will give you a list of all the 1-based byte offsets where your string of interest is found:
$ tail -n+2 chr9.fa | tr -d '\n' | grep -F -b -o "CACTGCC" | cut -d':' -f1 > forwardHitOffsets.txt
The -F
option is a fixed-string search (exact match). The -b -o
combination returns a 1-indexed byte offset. Because genomic sequence characters are single bytes, we can use those numbers directly as offsets or genomic positions.
Maybe your sequence is on the reverse strand, so you'd want to do a lookup for the reverse complement:
$ tail -n+2 chr9.fa | tr -d '\n' | grep -F -b -o "GGCAGTG" | cut -d':' -f1 > reverseHitOffsets.txt
To make a BED file from the forward-stranded hits:
$ awk -v SEQ="CACTGCC" -v OFS="\t" -v CHR="chr9" '{ print CHR, ($1 - 1), ($1 + length(SEQ) - 1); }' forwardHitOffsets.txt | sort-bed - > forwardHits.bed
For the reverse-stranded hits:
$ awk -v SEQ="GGCAGTG" -v OFS="\t" -v CHR="chr9" '{ print CHR, ($1 - 1), ($1 + length(SEQ) - 1); }' reverseHitOffsets.txt | sort-bed - > reverseHits.bed
To make one file:
$ bedops --everything forwardHits.bed reverseHits.bed > hits.bed
All the above steps can be worked into a parallelized pipeline, to eliminate intermediate files as well as to run queries in parallel, per-chromosome. I'll leave that as homework.
Once you have the file hits.bed
, you can do set operations to get all hits overlapping a specific gene, e.g., NOTCH1:
$ bedmap --echo --echo-map genes.bed hits.bed | grep NOTCH1 > answer.txt
The file genes.bed
would come from a BED file of refSeq genes from UCSC or GENCODE annotations, or other sources, where genes use HGNC symbol names.
That's a separate question, but one for which there are answers in other questions on biostars and Bioinformatics SE.
Nonetheless, to complete the demonstration:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/refGene.txt.gz \
| gunzip -c - \
| awk -v OFS="\t" '{ if (!match($13, /.*-[0-9]+/)) { print $3, $5, $6, $13, ".", $4; } }' - \
| sort-bed - \
> refGene.hg38.sorted.bed
You'd replace genes.bed
with refGene.hg38.sorted.bed
in this demo. This would give the following result:
$ bedmap --echo --echo-map refGene.hg38.sorted.bed hits.bed | grep NOTCH1
chr9 136494432 136546048 NOTCH1 . -|chr9 136495154 136495161;chr9 136495883 136495890;chr9 136500162 136500169;chr9 136500485 136500492;chr9 136500540 136500547;chr9 136501856 136501863;chr9 136503293 136503300;chr9 136506434 136506441;chr9 136506654 136506661;chr9 136507631 136507638;chr9 136508069 136508076;chr9 136510705 136510712;chr9 136515917 136515924;chr9 136517565 136517572;chr9 136521243 136521250;chr9 136521391 136521398;chr9 136523091 136523098;chr9 136525729 136525736;chr9 136528751 136528758;chr9 136530034 136530041;chr9 136532948 136532955;chr9 136536555 136536562;chr9 136538278 136538285;chr9 136540409 136540416
The answer is delimited in two pieces with a pipe symbol (|
).
The first piece is the NOTCH1 annotation. The second piece is a semi-colon delimited list of all the locations where your string (or its reverse complement) is found, which overlap NOTCH1 by one or more bases.
You can add overlap constraints to bedmap
. For instance, if you want to ensure that the entire query sequence is found inside the bounds of NOTCH1, add --fraction-map 1
:
$ bedmap --echo --echo-map --fraction-map 1 refGene.hg38.sorted.bed hits.bed > answer.txt
Read bedmap --help
for more options.
Amazing ! Thanks so much @Alex Reynolds !!