Hello NGS fellows,
I am a newbie here and would highly appreciate your advice about one particular experimental design.
We have data from RNAseq experiment which was originally designed to assess differential expression. The details of experiment are as follows:
2 modalities of the phenotype
Each phenotype is represented by 4 samples. 1 sample = 60 individuals pooled together at the stage of RNA isolation.
Molecule – polyadenylated mRNA
Sequencing chemistry – Illumina paired-end, read length - 2*100 bp
My question is whether it is correct to use this RNAseq data to call for SNPs? I made previous search and found that most of people calling SNP from RNAseq use 40-1000 samples (= individuals). But they initially designed RNAseq experiment for further GWAS. I see that this analysis cannot be applied to my data (at least because in my case individual flies were pooled without barcoding – 60 flies per a sample). However, can I still call for SNPs and upload the list to database as a list of potential targets for GWAS with, for example, estimation of functional impact upon protein structure? Will they be “true” SNPs, or our experimental design makes even this step invalid?
I found this paper https://www.ncbi.nlm.nih.gov/pubmed/27458203 where people used 2 phenotypes each represented by 2 samples what is almost like our experiment, but still have doubts.
A GATK variant calling best practices worked out example pipeline from this blog
Dear cpad0112, unfortunately GATK pipeline does not say a word about how many samples, with or without pooling etc produce reliable SNPs. If I am wrong about this and was looking the info in the wrong places, I would highly appreciate if you can provide the link with the correct information about experimental design.
I suggest you to consult a statistician near by:).
Well, that is another problem which lead me here and to the couple of other forums :-/