Hello everyone!
I'm relatively new to Bioinformatics so please bare with me. A Ph.D student in my lab has asked me to analyze ChIP-Seq data and determine whether peaks fall into exons, introns, or exon-intron junction categories. The gene she is looking at is TNFAIP3 and the mark she wants me to analyze is Pol II. I am using the hg19 human model.
Unsure how to start I am thinking of doing the following:
- Download BED files for TNFAIP3 and get individual files for Exons, and Introns.
- Annotate peaks of BED files using a python script already written using Homer.
After this I am unsure how to continue. Am I done once I have these annotations in a txt format that can be opened on excel? Her second question revolves around analyzing peaks for common sequence motifs so how would I prepare this information to continue into that?
What is the format of the ChIP-seq data you are talking about ? Is it raw reads, mapped reads or something else ?
I'm not entirely sure how to answer your question since I had no hand in the creation of the ChIP-Seq data. I have multiple file formats, but for Peaks I have a .fastq.gz file format however I also have a .narrowPeaks format (similar to BED format) for each individual protein she wants me to look at. With a bit of googling I am under the impression that raw reads = fastq format and mapped reads = BED format.
The .narrowPeaks format is most likely a bed-format file with one or more columns indicating scores (heights, statistical significance) for each peak. Maybe from MACS2?
https://github.com/taoliu/MACS/
Also, if you're looking at a single gene, it might be informative to look at the aligned ChIP-seq file on a genome browser.
You'll have to align your fastq file, then you can generate a .wig, .bedgraph, etc file from your .bam file.
Yes this is correct, I compiled a script that will use MACS2 to call peaks from a .sam file and that's how I generated a the .narrowPeak file. I have also already generated a .bigwig file so that I can view my data on IGV. Attached is the data that I'm currently looking at ... however, she wants a database that tells her where each peak is located (whether it is at an exon, intron, or exon-intron junction). But I do not know how to call these peaks and load them into a database for her.
This is an image we generated for a previous paper written. She wants to eventually look at these enrichment regions and see if there are any common sequence motifs which I will probably run using MEME or DREME once I figure out how to actually use the program. The way she wrote down what she wanted me to do was this ""Generate the site-database of exons, exon/intron junctions or introns for each individual protein within intragenic enrichment>" Maybe that will help.