After a few days of searching, a clear explanation of these three concepts and how to study them is hard to find. I have a half-baked idea of these concepts but I feel I'm missing a lot. After seeing this post, it seems the word "allele" can mean different things depending on context making it harder for someone with a non-bio background to figure out. Here is what I've figured out so far:
What data is required to look at allele-specific binding and allele-specific expression?
ChIP-seq is used to look at protein binding sites, so allele-specific binding type of analyses are done with ChIP-seq data. RNA-seq is used to study gene expression, so allele-specific expression type of analyses are done with RNA-seq data.
How to study allele specific binding?
Suppose I have a reference genome and a chip-seq data set. After aligning the reads to the reference genome, I find all SNPs. In order to study allele-specific binding, this implies looking at only the heterozygous SNPs. How are allele-specific binding sites identified using a list of heterozygous SNPs?
Allelic Imbalance
Nathan Sheffield posted a good explanation of allelic imbalance here.
I appreciate your response. That cleared some things up. I am using a dataset of high confident variant calls from the NIST Genome In a Bottle project as the source of my "known" heterozygous sites. I'm working with a small ChIP-seq dataset from here to start off with. To look for "enrichment of ChIPed alignments covering those sites", are you saying I need to count the number of aligned ChIPed reads that overlap a NIST heterozygous variant? This next step may be adding a layer of unnecessary complexity, but should I first identify binding sites from the data (using MACS for example) and then count reads that overlap a binding site and a NIST heterozygous variant?
Yes, you'll want to first call peaks with MACS or a similar tool and then only look at reads spanning heterozygous positions within those peaks. This will likely lead to a drastic decrease in the search space. You'll then count the number of reads spanning the variant with each for the genotypes.
Hi, Can you suggest a software tool to "count the number of reads". Thanks
Maybe
BCFtools mpileup
? See its documentation here.