HI, all
I have a question when computing region coverage of pair-end RNASeq data.
As showed by the sketch map, when computing region coverage, whether I should use actually mapped reads or extended reads? The coverage for impacted region may be different. I wonder which one is reasonable.
I want to compute and compare coverage for small regions like 100 bp, not as large as a gene. Also, one of the compared sample is sequenced Pair-endly, the other is sequenced single-endly. Which way of computing is more suitable in this case?
I have a Rip-Seq and want to use this RNA-Seq as a control to call peaks. The genome would be binned into 100bp regions and coverage of each region will be computed for both Rip-Seq and RNA-Seq. A following fisher's test would be used to select significantly enriched regions of Rip-Seq.
The region is defined. "How many fragments were sequenced from this region?" is what I want to ask . If one of my defined regions happens to located in the internal region of two ends of a fragment, the coverage of it in RNA-Seq would be 0 if just mapped tags used. However, this region is surely covered by reads.
Thank you very much!
Reference ========--------------------------------------------------===========
Actually mapped reads ^^^^ $$$$
Extended reads ^^^^^^^ $$$$$$$$$$$
Impacted region **** *******
Reference ======================================================================
Actually mapped reads ^^^^ $$$$
Extended reads ^^^^^^^^^^^^^^^^$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
Impacted region *************************************************************
#ignore the relative ratio of ^ and $
= represents exon bases; - represents intron bases
^ represents left-end reads; $ represents right-end reads
Update: Add my usage for this kind of data to make question clear.
Thans Ryan. Thanks for the nice explanation of ChIP-Seq peak calling process. My pair-end seq is input and single-end seq is IP result. The length of single-end reads is about 100 bp. The length of pair-end reads is 330 bp with 100 bp sequenced at each end. Is it ok to compare this two types of data?
Since your paired-end data is the input, you cannot use it to determine your protein's footprint. You'll need to use the peak-shift-size method on the single-end data if you can. Once you have an estimate of the protein's footprint size, probably the best approach is to assign each read a single base location at the estimated midpoint of the fragment that it represents. For example, if you estimate a footprint size of 150 bp, then you should take the 5-prime coordinate of each read and then shift it 75 bp (i.e. half of 150) in the 3-prime direction. Then whatever bin it lands in is the one you would count it in. But also keep splicing in mind. Since your reads are 100 bp long, you can just take the mapping coordinate of the 75th nucleotide in the read (for the example of a 150bp footprint size).
Thank you very much. I just confused about the shift part, even in analyzing ChIP-Seq data. I think the operation of shifting nucleotides based on the assumption that protein binds to middle of sequenced fragments. However, this can not always be true. So is there any explanation for the usage of shifting?
The assumption is not that the binding occurs in the middle of the fragments. It is that the binding occurs within the fragment. The most likely binding location is determined not from a single fragment, but from the superposition of multiple fragments.
The goal of the shifting is to find the midpoint of each sequenced fragment. The distribution of these midpoints should be centered on the binding site of the protein.
Some protocols have a nuclease step that digests any DNA that is not "protected" by being bound to a protein. In that case the estimated fragment size is actually representative of the protein's footprint. For example, micrococcal nuclease digests chromatin into individual nucleosomes, so the fragment length is the length of DNA wrapped around one histone. However, if your protocol does not include such a step (probably most likely for transcription factor ChIP), the fragment midpoints are still unbiased estimators of the binding site location, though the fragment size will be larger than the protein's footprint size.