Hello,
So, I've got a task to predict pairs of TF-target gene via binding site. After done some research, I've got a general idea and I want to share my idea here to hear your opinion or expertise. Note: I'm still a beginner and sorry if I have a lot of mistakes.
To predict a binding site, the first things I need is both the transcription factor sequence and target gene promoter region.
To get the TF sequence, I think I need to use cDNA sequence for its transcript. I use ensembl, so I think I need cDNA fasta file which contains ENST id. I come in conclusion to use the transcript sequence because the protein which binds to the target gene come after alternative splicing so transcript sequence is the most suitable.
To get the target gene promoter region, I need to use the full human genome reference and do some rough estimates where TF will bind in the non coding area before the coding region start/exon. I also need to estimate every possible non coding region between coding region for a gene. So, I think I will have several regions for a single target gene. I will need to extract the sequence of these regions.
After I have the transcript sequence and the possible binding regions, I need to use alignment algorithm to check if the binding possible or not. I also need to check for both directions, forward and reverse strand.
My questions are:
What do you think about my workflow?
I'm a bit confused if I need to handle reverse strand alignment and 5' to 3' direction.
Suppose I have this as transcript sequence:
5' - ATCATGCGA - 3'
I have the DNA region for forward strand (I think the direction of fasta from ensembl is forward strand)
5' - TGCATACGT - 3'
Which means the reverse strand is:
3' - ACGTATGCA - 5'
If I want to align transcript sequence to forward strand, do I need to reverse it so that 5' meet with 3' and vice versa? The string comparison will be like this:
Forward strand : 5' - TGCATACGT - 3'
Transcript seq : 3' - AGCGTACTA - 5'
And for reverse strand comparison:
Reverse strand : 3' - ACGTATGCA - 5'
Transcript seq : 5' - ATCATGCGA - 3'
Thank you.