I am currently a first year grad student learning bioinformatics. I come from a purely mathematical background and I am learning biology and biological terms/functions as I go along (this can be challenging).
In a paper I am reading titled "A Hidden Markov Model to Identify Combinatorial Epigenetic Regulation Patterns for Estrogen Receptor α Target Genes" by Bonneville and Jin, I am a bit confused about what something they said. Let me give you some background: the paper talks about two cell lines (MCF7, MCF7-T) where the MCF7-T is the cell line that is resistant to Tamoxifen (a breast cancer medicine). They are using Hidden Markov Models to identify and analyze epigenetic patterns in these cell lines.
Question:
In section 2.1, they divide each chromosome in the entire genome (hg18) into 1000 bp bins. For each dataset, the unique matched reads were assigned to a bin according to their 5' end. They determined the mark status for each bin for each of the eight datasets as either 0 (non-mark) or 1 if for mark if the number of reads in the bin is sufficient such that P < 10^-4. They tell us to look at the supplementry data which is below:
+---------------------------+-----+-------------+-------------+
| Mark | ID | MCF7 | MCF7-T |
+---------------------------+-----+-------------+-------------+
| ERα/E2 | 1 | 8,852,289 | 4,445,403 |
| ERα/Control | 2 | 1,731,183 | 5,161,146 |
| PolII/E2 | 4 | 6,782,648 | 5,712,811 |
| PolII/Control | 8 | 6,439,650 | 7,434,175 |
| H3K4me2 | 16 | 2,417,878 | 6,650,379 |
| H3K27me3 | 32 | 49,900,845 | 53,481,772 |
| H3K9me2 | 64 | 4,916,106 | 5,228,383 |
| DNA methylation (MBD-seq) | 128 | 27,395,480 | 23,725,331 |
| Total | | 108,436,079 | 111,839,400 |
+---------------------------+-----+-------------+-------------+
My question(s) are related to the bolded parts.
1) What are unique mapped reads? If I understand correctly, Chip-Seq for Erα and ER2 will yeild the DNA sequence of where the transcription factor and the DNA is bound. Then these are split into "reads" which are short (often overlapping) sequences by a NGS machine. Does the table imply that for every region where ERα is bound, you take all the reads and add it together for ALL the regions and together they are ~8 million reads? Can someone help me understand this better?
2) What is the mark status? If I understand correctly, a Mark status is just saying what we Chip-Seq'ed? (note that you may not have to worry about ID as that comes later in the paper).
3) What are the 1000bp bins? If I understand correctly, there are ~8 million reads for Mark 1. Those reads (by sequence alignment) will fall under some bin.
Thanks, For number 3 just so I understand correctly: They took the genome from hg18 and separated into 1000bp bins. So if each dot represents some bp, we have .....|.......|.......| where the "|" separates them. Now, say for mark1, we look at the Chip-Seq data. Chip-Seq gives us "reads" which are just short fragements. We see where these fragments align with in the bins, and so thats why you say that most bins will be empty, correct?
Exactly. In fact I should add the the goal of Chip-seq is to have relatively few bins contain any reads, since you expect your "mark" to be primarily localized to a few regions (this is much less the case for MBD-seq, but that's a quite different thing all together)
Okay, so suppose I have a bin. |............|. Underneath this bin I will likely try to align my reads, I could potentially have a LOT of them. So how do I tell that these reads express the mark (like in your point #2)?
Just for clarification, the reads are mapped to the whole genome, not a bin at a time (perhaps you knew that, but it's better to clarify it at the outset). The bins are just post-hoc arbitrary partitions of the genome.
It's not that the reads express the mark, but rather that the presence of the mark in an area will increase the number of reads that will map there (if you're unfamiliar with how Chip-seq works, find someone local to draw it out for you on a whiteboard, that's probably the easiest way to understand it). The enrichment of reads mapping to an area is imperfect for a variety of regions. For this reason, one would normally run multiple replicates and negative controls (to gauge the background alignment rate) and use that together to declare the presence/absence of a mark in an area, which seems to not have been done in this paper in all cases (there are at least 2 controls).