Hi all,
I am very new to this kind of work so I would appreciate some opinions on the early stages of a SNP/Indel discovery workflow.
- Formulate list of genes of interest.
- Design custom capture for all exons from these genes.
- Sequence (Illumina, Paired End, 100bp reads, 50x coverage minimum)
- Remove duplicate paired end reads.
- Align to genome (Bfast to allow gaps and find indels as well as SNPs)
- Use SAMTools mpileup with varfilt to remove SNPs with a quality score less than 20 or indels with a score less than 50.
- Remove SNPs with low coverage (less than 30x?)
- Proceed to association type study (details to be worried about later - my major concern is the upstream, NGS stuff at the moment as I have never done it before).
All comments welcome. Thanks in advance!
Hi Travis. This all looks well thought of and solid to me. I don't know what I would add other than discussing assembly details and such. Would you have more specific questions about some parts of this workflow?
In case you are working with tumors, I would add that you should sequence the germline matched DNA to substract the germline snps and be able to distinguish the somatic events.
Thanks a lot for the responses guys!
Tony: This study won't be cancer-related. Perhaps a good thing as it sounds like that complicates things.
Eric: More specifics are welcome. I am a complete newbie - I have never run any of these tools before!
Minor detail: it's easier to remove duplicates after aligning.
Come to think of it, I don't even know.how to remove duplicate pairs yet! Any pointers? :)
I am not sure how removing exact paired-end duplicates is going to change anything. Moreover, if you do exon capture, you may end up with too short sequences to do paired-end, at-least if you use an array with oligos representing your exons on it. Since you have no specific question in there, it is a bit hard to discuss specific details.
Dont for get:
7.5. What to do when we got too many / no sensible variants.