Hello, I am carefully performing an SNP analysis, with BWA, SamTools, Picard, Bcftools and VarScan, primarily. The one thing I am not sure I have under control is the strand information in my data. My reads are 75bp paired-end from a cDNA library not strand-specific. My question is, how does varscan handle this? Will the input have info about this? I know that some A-to-G calls will be lost in the absence of strand information. Other than that, will the final vcf file contain this data? whether the SNP is on the plus or minus strand? If not? how would I best tackle this problem? Thanks, G.
Yes, this is true.
OK, thanks. I guess my confusion stems from the fact that in VarScan.jar filter, there is the option (--min-strands2) to be set as 1 or 2. What's the recommended value here? Should the variant be observed in the two strands? or is 1 enough?
Only you can answer this question. The "strand" in this case has to do with the strandedness of the READ. If the variant is only on one strand, that may represent a false positive finding. I would recommend using VarScan liberally (let everything through) and then filter after-the-fact.
Yep, Sean nailed it. SNVs that only appear on one strand or the other are often false-positives. That said, if sequence coverage is low at that site, you may only see reads from one strand just by chance. It's up to you to decide what acceptable thresholds are.
Thanks All!. Will keep this in mind.