Question

Finding TATAWAA in sequence

0

Entering edit mode

8.0 years ago

baski • 0

Hello,

I have human gene BMP1 form BMP family.
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000168487;r=8:22164736-22212326. I'm trying to find TATAWAA sequence near promotor -1000bp . I tried to get -1000bp from the gene and use blast to find TATAWAA sequence, but nothing working.

I'm sure it should be any TATAWAA sequence there(but maybe i'm wrong). Could anyone help me or lead me how to find it ?

Thank you guys.

gene fasta • 3.0k views

ADD COMMENT • link updated 8.0 years ago by shenwei356 8.7k • written 8.0 years ago by baski • 0

0

Entering edit mode

Maybe I miss something crucial, but if I understood correctly you are looking for exact matches of TATATAA or TATAAAA in a short sequence, what about just getting the piece of DNA and cntrl-F using your favorite text editor?

ADD REPLY • link 8.0 years ago by WouterDeCoster 47k

0

Entering edit mode

Yeah you are right. I do that so, but i thought i missed something. Thank you.

ADD REPLY • link 8.0 years ago by baski • 0

score 1 · Answer 1 · 2016-11-09

I wrote a command-line genome browser, ASCIIGenome, that may be handy in cases like this to find sequence motifs in a reference fasta file and show them together with an annotation file.

First load the reference sequence and an annotation file (this can be a remote file. Fasta file must be indexed):

ASCIIGenome -fa /path/to/genome.fa \
    ftp://ftp.ensembl.org/pub/release-86/gff3/homo_sapiens/Homo_sapiens.GRCh38.86.chromosome.8.gff3.gz

Then at the command prompt issue these commands:

find ENSG00000168487
grep -i \tgene\t.*ENSG00000168487 gff3
seqRegex -iupac TATAWAA
zo 8
print seqRegex
seqRegex > matches.bed
save matches.png

Explained: Find the gene ENSG00000168487, for clarity only show the "gene" feature (grep ...). Then search the motif TATAWAA as iupac code, zoom out x times (e.g. 8 times) to see some matches in the sequence.

The matches can be show on screen (print seqRegex) and saved to file (print seqRegex). Finally save a picture as png.

The resulting image should like similar to this:

enter image description here

score 1 · Answer 2 · 2016-11-09

Just use the cross-platform and ultrafast FASTA/Q tookit SeqKit

Geting GTF (time: 5s, memory: <1M)

$ zcat Homo_sapiens.GRCh38.84.gtf.gz | grep BMP1 | awk '$3=="gene"' > BMP1.gtf

Three records extracted:

$ cat BMP1.gtf 
2   ensembl_havana  gene    68865481    68871517    .   -   .   gene_id "ENSG00000163217"; gene_version "1"; gene_name "BMP10"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000129573"; havana_gene_version "1";
X   ensembl_havana  gene    50910784    50916607    .   +   .   gene_id "ENSG00000130385"; gene_version "5"; gene_name "BMP15"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000021525"; havana_gene_version "2";
8   ensembl_havana  gene    22164736    22212326    .   +   .   gene_id "ENSG00000168487"; gene_version "17"; gene_name "BMP1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTHUMG00000097761"; havana_gene_version "6";

Extracting upstram sequence (FASTA index pre-built. time: 2.6s, memory: 983M)
```
$ seqkit subseq --only-flank --up-stream 1000 --gtf BMP1.gtf hsa.fa  > BMP1.usf_1k.fa
```
Gene locations were stored in sequence header. For exmaple, 2_68865481-68871517:-_usf:1000 means 1000bp upstream flanking sequence of region 68865481-68871517 in negative strand of chromosome 2.
```
$ seqkit seq --name BMP1.usf_1k.fa 
2_68865481-68871517:-_usf:1000 ENSG00000163217
X_50910784-50916607:+_usf:1000 ENSG00000130385
8_22164736-22212326:+_usf:1000 ENSG00000168487
```

Locating pattern (time: 0.1s, memory: 0)

$ seqkit locate --ignore-case --degenerate --pattern TATAWAA BMP1.usf_1k.fa  | column -t 
seqID                           patternName  pattern  strand  start  end  matched
2_68865481-68871517:-_usf:1000  TATAWAA      TATAWAA  +       578    584  TATATAA
2_68865481-68871517:-_usf:1000  TATAWAA      TATAWAA  -       577    583  TATATAA