Hi I've performed an RNA-immunoprecipitation and then sequenced the results using NGS. We already have the reads mapped (s. cerevisiae), however we have a simple question which we are having difficulty answering, how many reads must be found for a specific transcript for that transcript to be considered present or statistically significant? We want to answer this question because there are some genes which only have a few reads, while others have over 500 and we think there's always the possibility that those 3 reads could have been due to error while it's much more likely the 500 reads are not. So we want to develop a statistical analysis to answer this question, does anyone know if there is already one out there or if you have any ideas as to how I should start?
Let me be clear I'm dividing my analysis into two questions here, whether the transcript has enough coverage to be considered usable and if it does is it highly enriched?
Also, my application of this experiment is not for determining where splice sites or SNPs are located, rather we really just need your most basic coverage to determine which RNAs are present.
Thanks, I appreciate any input
edit (6/3/10)
After the discussion below with Eric it has occurred to me I can explain my question in another way. I think there have been two types of methods discussed here (see the second answer and comments). One where a strict threshold is set i.e. only look at genes with at least 10 tags, and another where the threshold is based on some sort of stat making the threshold, as it relate to raw tags, variable. I guess in your opinion is one better than another, should I use both or neither? The other point I guess I'm getting at is if I see that this statistical method illustrates very few transcripts have enough tags above background, do I conclude I need more sequencing depth/more replicates?
another edit (6/4/10): Sorry, I failed to mention that we have a test sample in which our protein of interest is myc tagged and the RNA in the IP is being pulled down from our protein of interest. And we have a control in which the strain has no myc tag encoded in the genome (wild-type) so we pull down background RNA (with consistency). So I need to compare my test to my control (background) and see if the test has enough reads above background (statistically speaking) to be considered present in addition to being highly IP'd over other transcripts.
This is an interesting read however I got lost when all the equations were used (I'm trained as a molecular biologist, but I'm dabbling with bioinformatics). If you have a second, could you explain how you think I could implement their methods, keeping in mind I'm more of a wet lab type of guy? Thanks
i'm not suggesting you implement their methods! just suggesting that they (who know about these sort of stats) have created cufflinks and described the model and potential short-comings and example use for something very close to the problem you describe. so ... don't try to implement anything, just try to run top-hat and maybe cufflinks on your reads.
The only caveat is that I don't think cufflinks is used when you have a background control or input, since you are essentially measuring input (total RNA) alone. Sorry, I didn't mention originally I had a background control that is sequenced, that sorta changes things. I edited my original post.