One of the projects we have in the lab is aimed at uncovering genes where gene expression is linked with SNP allel variants. For example, having a "T" instead of an "A" at position X in gene Y is linked with a 2 fold gene expression difference.
For this, I have about one million 454 sequences (around 300 bp) of cDNA and I plan to do the following (most is already done):
- Assemble all sequences de novo into contigs (representing genes)
- Save consensus sequences
- Reassemble using the contigs as a reference
- Detect SNPs (export SNP table)
Now for the tricky part, for which I would appreciate your suggestions. I need to statistically test for allel-specific gene expression across the 16 individually tagged fish. For that, I will use only those fish which are heterozygous.
The goal is to end up with a p-value that tells us that this gene show SNP allel-specific gene expression differences.
(NOTE: see added biological information in comment below)
Please tell me how you would proceed?
I added a bounty to this question. The accepted answer will give +100 reputation points to its author :)
Can you add some info about genome architecture of your fish (ploidy, sex-related chromosomes, etc.)? If your fish is an haploid, low recomb, gene-determined species, the answer will be pretty straightforward.
Here are some more details. The fish is pseudo-diploid, with an event of duplication about 50k to 100k years ago. Sex chromosomes are unknown in most fish species, including this one. The samples come from 2 backcross strains with one of the ancestors having undergone an artificial selection program. We have 8 individuals per strain. Cheers.
@Eric Normandeau - found a paper that might interest you (see edit in my answer)