Question

what are short reads in Chip Seq and how come there are so many?

1

Entering edit mode

11.5 years ago

Affan ▴ 310

Hi all,

This may be a very basic question but I seem to be having a lot of trouble wrapping my head around this. I've begun my research in bioinformatics (Chip Seq, TF stuff) and am reading material to understand what Chip Seq is before I look at the computational part of it. (For what its worth, I am an Applied Math Masters candidate).

My question can easily be explained with one picture. I am really just stuck at step 1.

Okay my question is how come there are so many overlapped reads? Suppose I have a DNA sequence and if I shear it into fragments then how can they possibly overlap?

My "understanding" is that:

1) They take a DNA sequence and crosslink the protien of interest. Then they get rid of the DNA sequence surrounding this area of interest so now we have a "small" sequence of DNA and in this small DNA is somewhere where our TF binds. Now we make copies of this small DNA seq and run it through a sequencing machine. Is this correct?

2) They take a bunch of cells = bunch of DNA sequences. Then they do the same procedure above (by crosslinking and getting a "smaller" DNA sequence of interest). Since they had many cells to begin with, this means they had many DNA seq to begin with. Now they shear the DNA seq and we have fragments. Then we align these up with the reference genome.

Am I almost getting there in my understanding?

A secondary question is what is the significance between the red/blue alignments?

ChIP-Seq • 4.8k views

ADD COMMENT • link updated 11.5 years ago by Istvan Albert 103k • written 11.5 years ago by Affan ▴ 310

Ram · Answer 1 · 2014-04-20

2

Entering edit mode

11.5 years ago

Devon Ryan 105k

You're quite close, you just have to remember that (1) and (2) are both done together, so you will start with a LOT of cells in any case and then perform the cross-linking/fragmentation/purification/reverse-crosslinking/etc.. Yes, there is often an additional amplification step prior to sequencing.

I would guess that the red/blue coloring is for read#1 and read#2 in a pair (or it's from a stranded/directional experiment and denotes the orientation of each read). In either case you would expect two peaks and that your protein(s) of interest bind somewhere between them.

ADD COMMENT • link 11.5 years ago by Devon Ryan 105k

0

Entering edit mode

Technically the expectation is that the fragment represents the actual bound DNA.

It is true that the method/library preparation used be so inaccurate that the fragment was often much larger than the footprint of the protein. But this keeps improving.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 11.5 years ago by Istvan Albert 103k

0

Entering edit mode

Yeah, some of the newer methods give very exact locations, but this is completely method dependent.

ADD REPLY • link 11.5 years ago by Devon Ryan 105k

Ram · Answer 2 · 2014-04-20

One important detail to remember for all sequencing experiments is that the sequencing always proceeds from the 5' to 3' location. That is left-to-right on the forward strand and right-to-left on the reverse strand.

Moreover depending on the lenght of the fragment and length of reads this means that you can end up with partially overlapping read with various configurations. You can have reads that are shorter than the fragment, or longer, almost never exactly the same lenght. Then there will be all kinds of partial overlaps.

These configuration below could be observed when the fragment is more than twice as long as the read, less then twice as long but still more than read lenght, and less then a readlenght.

DNADNADNADNADNADNADNADNADNADNADNA
----------->         <-----------

DNADNADNADNADNADNA
----------->
       <----------

     DNADNADNADNA
     ---------------->
<----------------

ChIP-seq is best when visalized just as borders 5' end. The actual overlap and read coverage can be quite misleading.