Chip Sequencing For A Tiny Genome
2
0
Entering edit mode
10.8 years ago

I'll soon be sequencing my ChIP samples of point-source transcription factors (TF) that I believe have a average to high number of binding sites throughout the genome compared with the "average TF". I am currently studying a organism, Oikopleura dioica, which has a very small genome for a chordate: 70Mb.

The ENCODE guidelines for point-source TF ChIP sequencing are "a minimum of 20 million uniquely mapped reads" (Landt *et al,* 2012) in mammalian cells and a tenth for worms and flies, per factor (combining replicates). For a human genome that would be a coverage of 0.6 or 1.2, and worm 2 or 4 depending on using a 100 or 200 bp read length (they don't specify on the paper).

If I am to have a say, 8X coverage for the samples of my organism (70Mb), I'd need to sequence 2.8 million 100bp PE reads (or have that amount mappable, but let's simplify for now) - a total output of 560Mb.


My samples are going to be run in a facility that has a Illumina Hiseq 2500. This instrument has a output capacity of around 150 million reads per lane, that's 30 Gb with 200 cycles (100bp PE). If I only use 2.8 million reads, they'd be able to fit more than 50 samples on a single lane of the instrument using multiplexing. I'll have about 9 samples only for now and I know there are some other small genome samples being sequenced around the same time, but the machine is mostly used with mammalian cells.

Concerning the sequencing depth for my samples: is my general reasoning correct? I am right to make the transposition between organisms based on coverage? Concerning how to manage the sequencing: what's the ideal way of handling this? Should I sequence more of my samples to avoid being on a cue for other small samples?

Thank you.

chip-seq illumina • 3.6k views
ADD COMMENT
1
Entering edit mode
10.8 years ago

Genome coverage is not the right way to estimate and extrapolate ChIP-Seq experiments. The amount of required data depends very heavily on the number of bound locations and their occupancy.

The numbers that are quoted above come from averaging over a large number different factors but also correspond to large and repetitive and less packed genomes. It would not be surprising if these numbers would not scale at all for very different genomes. You may need a lot more or a lot fewer reads.

The best way to go about this type of situations is to run a pilot study with a fairly high coverage to catch even rare events then evaluate the rates and coverages.

ADD COMMENT
1
Entering edit mode
10.8 years ago
Ian 6.1k

I once had a set of samples for yeast (S.c) with around 20mill reads per sample. It broke MACS (I think specifically the binomial calculation used to work out the optimum level of read redundancy). In the end I sampled 1.5 and 5 million reads. Yeast has ~12.1million bases (12Mb) so you calculation might be on the low side.... I certainly second Istvan in the a pilot is needed. I would play safe with 5-10 million reads (ChIP and input) and titrate down. Better to throw away than not have enough.

ADD COMMENT
0
Entering edit mode

I keep my reference for S.c ~ 5mil.

ADD REPLY

Login before adding your answer.

Traffic: 1686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6