Entering edit mode
6.2 years ago
salvatore.raieli2
▴
90
Hi everyone,
I have a list of genes that are different expressed between tumor and normal tissue. I have two drugs that block two different transcriptional factor (let's say TF1 and TF2). What I want to do is to find which of these genes have in their promoter a TF binding site for TF1, for TF2 and for both.
For a similar task with a singular gene I did this:
library("TFBSTools")
library("JASPAR2018")
library("Biostrings")
#to retrieve the matrix for a specific tf in thre jaspar database
opts <- list()
opts[["species"]] <- 9606
opts[["name"]] <- "MYC"
PFMatrixList <- getMatrixSet(JASPAR2018, opts)
pwm <- toPWM(PFMatrixList, pseudocounts=0.8) #generation of PWM matrix for MYC
seq1 <- read.delim(...) #the sequence of a gene with the promoter I downloaded from NCBI
subject <- DNAString(seq1) #making as DNA string
#finding the site in the sequence
siteset <- searchSeq(pwm, subject, seqname="seq1", min.score="60%", strand="*")
I have around 100 genes I think that this approach is impraticable.
Thank you in advance
Salvo
First of all, what you aim to find is a transcription factor motif. If a motif is also a binding site needs to be determined by experiment, because not every motif automatically means that a TF will bind there. For example, a motif in heterochromatin (and there a thousands of "unused" motifs for every factor across the genome) is unlikely to ba bound by a TF. You should have a look at FIMO from the
MEME suite
. It takes as input a fasta file with sequences, e.g. your promoter sequences, and a motif position frequency matrix, e.g. fromJASPAR
, and then scans the sequences for motif occurrence, outputting a GFF file.thank you, I will give a look to FIMO. I am not sure it is what I need. If I understand well I have to provide the FASTA for each gene I want to scan, right? Meaning that I have to manually download 100 genes and their promoters in the FASTA format. With FIMO can I automize this process?
You can pass a multifasta to FIMO, in the form:
Export the list of genomic coordinates to disk, and then use
bedtools getfasta
to make the multifasta. This can be fed into FIMO, which is fully automated from this point on.If I understood well, I get the genomic coordinates for my list of genes and with bedtools getfasta I obtain the fasta, I take all the fasta in one multifasta that I feed to FIMO, right?
If you feed a BED file with all the coordinates you have (your promoter regions), then bedtools will produce a multifasta file, each line representing one line of coordinates:
This multifa you feed into FIMO.
how I can generate a BED file just for the promoter regions of a list of genes?
By querying a database of promoter regions if available, and if not available, creating one.
For simplicity, you could take a window upstream of your genes, say 250bp.
I was thinking something similar if I am not finding a good database
I think you do not need a database. A promoter is always upstream of the first exon. When you do ATAC-seq (scan for open chromatin), a typical peak from a nucleosome-free region is hardly larger than 200bp. For motif scanning, in order to avoid matches by change, you anyway should limit the size of the regions you scan. I would go for e.g. 50bp downstream and 200bp upstream of the first exon, resulting in the 250bp, and then run the analysis. See if it works out, which I think it should.