Hi all,
This may be a very basic question but I seem to be having a lot of trouble wrapping my head around this. I've begun my research in bioinformatics (Chip Seq, TF stuff) and am reading material to understand what Chip Seq is before I look at the computational part of it. (For what its worth, I am an Applied Math Masters candidate).
My question can easily be explained with one picture. I am really just stuck at step 1.
Okay my question is how come there are so many overlapped reads? Suppose I have a DNA sequence and if I shear it into fragments then how can they possibly overlap?
My "understanding" is that:
1) They take a DNA sequence and crosslink the protien of interest. Then they get rid of the DNA sequence surrounding this area of interest so now we have a "small" sequence of DNA and in this small DNA is somewhere where our TF binds. Now we make copies of this small DNA seq and run it through a sequencing machine. Is this correct?
2) They take a bunch of cells = bunch of DNA sequences. Then they do the same procedure above (by crosslinking and getting a "smaller" DNA sequence of interest). Since they had many cells to begin with, this means they had many DNA seq to begin with. Now they shear the DNA seq and we have fragments. Then we align these up with the reference genome.
Am I almost getting there in my understanding?
A secondary question is what is the significance between the red/blue alignments?
Technically the expectation is that the fragment represents the actual bound DNA.
It is true that the method/library preparation used be so inaccurate that the fragment was often much larger than the footprint of the protein. But this keeps improving.
Yeah, some of the newer methods give very exact locations, but this is completely method dependent.