I have the results of ChIP-Seq experiments for histone modification - aligned reads, and want to develop a script that will identify "true" peaks (I know about MACS, SICER etc. but I need to get familiar with the pre-processing of the data using a custom script). No control data are available to assess the noise.
Which characteristics would form a set of features that will allow for a simple procedure yet yielding good enough results?
I think the ratio: no. reads in the reverse strand / no. reads in the forward strand, and the no. reads within a window should be included.
To be quite frank, if you want to develop your own peak caller then you shouldn't be asking these questions. Read the literature. Look at what other tools are already doing, then try to come up with something sensible yourself.
As with most things in bioinformatics, people develop programs to address a specific problem or part of a problem, or to improve on a previous solution based on newfound information/knowledge/research.
You need to ask yourself why you want to develop a new program and who would be the audience, i.e., why would anyone use your program instead of others?
In ChIP-seq, the main parameters that are important are:
Quality control of raw reads, including adaptor trimming, eliminating reads below a certain length (e.g. 50bp), and ensuring that each base in each read is > 20 Phred score
performing unique mapping to a reference genome and setting the mapping quality (MAPQ) high (e.g. >50 Phred score) to prevent false alignments
removing duplicate reads that result from PCR amplification step
strand-specific alignments (depends on the NGS lab prep protocol and also the bioinformatic aligner and its parameters used)
read depth at each position / number of reads mapping at each position
maximum number of reads mapping to each position
the depth of coverage 'profile' and where to define a cut-off for signal / no signal, and for peak region identification (e.g. instruct to cease looking further for extending a region if the depth of coverage falls below 10x).
window size for moving along the genome and detecting peak regions
minimum/maximum peak starting size
maximum peak region allowed
normalization of depth of coverage and type of normalization used (e.g. Reads Per Kilobase per Million mapped reads [RPKM], reads per genome coverage [RPGC], etc.)
P- and Q-values, and fold-changes, for identifying where a region ends
Off the top of my head, current programs include HOMER, SICER, MACS, and bedCoverage.
Here's a pipeline developed by ENCODE for narrow TF peaks. However, you may need to modify it a little to work with broad peaks depending on what histone markers you're looking at.
To be quite frank, if you want to develop your own peak caller then you shouldn't be asking these questions. Read the literature. Look at what other tools are already doing, then try to come up with something sensible yourself.
I do read the literature but was looking for the extra input of experience. Thanks.