I have a list of 1900 genes of zebrafish. i would like to find out the which TFs bind to my list of genes of zebrafish. please send links
I have a list of 1900 genes of zebrafish. i would like to find out the which TFs bind to my list of genes of zebrafish. please send links
I am assuming you have tools like wget
and grep
etc. and that you have installed BEDOPS. I am also assuming you are working with the zebrafish build danRer7
, and that you have samtools
-indexed FASTA files for this genome downloaded locally.
To build a BED file of intervals for your genes:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/danRer7/database/refGene.txt.gz | gunzip -c > danRer7.refGene.txt
$ grep -f genes.txt danRer7.refGene.txt | awk -v OFS="\t" '{ print $3,$5,$6,$13,$9,$4 }' | sort-bed - > danRer7.genes.bed
To generate TSS regions, separate by strand and get the start position of the gene:
$ awk -v OFS="\t" '($6=="+"){ print $1, $2, ($2+1), $4, $5, $6; }' danRer7.genes.bed > danRer7.genes.start.for.bed
$ awk -v OFS="\t" '($6=="-"){ print $1, $3, ($3+1), $4, $5, $6; }' danRer7.genes.bed > danRer7.genes.start.rev.bed
Then apply a 5kb asymmetric padding around the start positions (you can pick a different window to define the regulatory region upstream of the TSS):
$ bedops --range -5000:0 --everything danRer7.genes.start.for.bed > danRer7.genes.start.for.TSS5k.bed
$ bedops --range 0:5000 --everything danRer7.genes.start.rev.bed > danRer7.genes.start.rev.TSS5k.bed
Take the union of the two files:
$ bedops --everything danRer7.genes.start.for.TSS5k.bed danRer7.genes.start.rev.TSS5k.bed > danRer7.genes.TSS5k.bed
Separate out the padded TSSs per gene:
$ awk '{ print $0 >> "danRer7.TSS5k."$4".bed"; }' danRer7.genes.TSS5k.bed
Generate FASTA for each gene using the bed2faidx script. Here's how you would do this for one gene, say "LOC567192":
$ bed2faidx --fastaDir=/path/to/danRer7/indexed/fasta < danRer7.TSS5k.LOC567192.bed > danRer7.TSS5k.LOC567192.fa
You have ~1900 genes in genes.txt
, so it is probably easier to use a loop:
$ while IFS= read -r gene; do echo "$gene"; bed2faidx --fastaDir=/path/to/danRer7/indexed/fasta < "danRer7.TSS5k.${gene}.bed" > "danRer7.TSS5k.${gene}.fa"; done < genes.txt
Once you have FASTA for each gene's padded TSSs, you can run MEME on the sequence information to get motif calls. You would then run TOMTOM to match what MEME discovers against entries in databases of published TF motifs.
The web interface for MEME and TOMTOM will only let you do one set of regions at a time. I would suggest you take one gene's sequence information and run it through these two tools to see what you get back, to get familiar with what these tools do.
Since you need to automate this, my advice is, once you know how MEME and TOMTOM work, you then download the source code and databases and build command-line versions of these tools that you can run locally. Then you know exactly what parameters go into your search. This is going beyond the scope of the answer here, but you can search Biostars and there should be answers that have already covered this topic.
1) Use BEDOPS bedops
to generate a BED file of regulatory regions of genes (like a 5kb region upstream of TSS, say).
2) Convert BED to FASTA with samtools
and bed2faidx
for your indexed reference genome.
3) Run the FASTA through MEME to find putative motifs.
4) Run the motif PWMs through TOMTOM to find highest ranking matches with known/published TF motifs (e.g., matches with JASPAR, TRANSFAC, UniPROT, etc.).
Search engines on the relevant terms will get you links to the tools mentioned here.
Could you send some tutorial to get bed file of regulatory genes. Can this method give regulatory sequence bed file for my list of genes at a time. Do I need to work out for each gene at a time. Please, guide me how to find regulatory region of gene with bedops because i am not familiar to this.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I have t search for every single gene, so it is not possible for my 1900 genes list