So I downloaded the TFBS from the Riken 4 database. My task is to extract the binding sites for the transcription factor MEF2 and then create a PWM for analysis. Since this is for my research, I'd like to make sure that what I am doing is right.
1. To extract only the MEF2 transcription factor, what column am I looking at?
Edit: The last column (column 9) seems to give me the information. For the Mef2 family of proteins, it's annotated as
TF_binding_site_cage_181208 MEF2A,C,D-173792 ;ALIAS MEF2A,MEF2C,MEF2D ;L3_ID L3_chr7_-_150385881
To extract this specific data I used the command
awk -F"\t" '$9~/MEF2/' file > output
2. Now suppose I have all the rows for the MEF2 TF. For each row, I have a start and end for the binding site. What software is usually needed to perform the alignment so that I may calculate the frequency counts.
3. Relating to number2, do I have to worry about the strand? I don't think so. 4. Is there any software/papers that talk about this from start to end?
Disclaimer: I am a math grad student doing research in bioinformatics. So while I'll be okay once I get the PWM, its the tools, software and the biological knowledge needed to get there.