A very short question, I have two BAM files coming out of a ChIP-seq experiment. File A has 29 million reads and File B has 47 million read. The problem arises when I count the tags from these two files in the genomic regions in question, because one has higher number of reads then the other.
Is their a way to normalise these two files?
I know, the normalisation can be done even after counting the tags in regions (commonly reffered as region based mnormalisation).
You have it right. Normalization and analysis are done at the count level, not the BAM file level. There are a number of reasons for this, but the important one is that the actual counts, not just the relative counts, are important in most statistical approaches to chip-seq data. You could down-sample your larger BAM file, but that would definitely be counterproductive.
Hi! so, I should count the tags in region and normalise like norm=((tags in region/length of region)/sum of all tags present in all regions). Something like this?
You could try RPKM which is similar to the equation you have given above, with length of the region represented in kilobases and the "sum of all the tags" replaced by "total aligned tags (in millions)".
Hi! so, I should count the tags in region and normalise like norm=((tags in region/length of region)/sum of all tags present in all regions). Something like this?
You could try RPKM which is similar to the equation you have given above, with length of the region represented in kilobases and the "sum of all the tags" replaced by "total aligned tags (in millions)".