I have sequenced scRNA-seq data and I would like to do variant calling for one gene (KRAS) so I can annotate clusters in the downstream analysis. I have looked at past posts and there doesn't seem to be a lot of consensus on what tools to use. However, I understand that I would do something like: 1) Align reads with STAR to generate a BAM file and subsequently generate a pileup file 2) Run the FreeBayes variant caller to find SNVs
Most of the tools and workflows for variant calling tend to focus on finding SNP in the entire genome, and I would only like to look at one specific gene, KRAS.
Another question I had, is what specific read depth is appropriate for variant calling only for one gene? (as opposed to the entire transcriptome). Would there be a difference?
What single cell sequencing method did you use? Most single cell libraries are biased toward one end or the other of the transcript, you might only have a fraction of the transcript covered with reads.
We use 10X. I was thinking of using FreeBayes on the aligned reads, although I know that the called SNPS are strongly contingent on the read depth.
As the comment above already indicates, the ability to detect mutations will strongly depend on whether you actually managed to sequence the part of the KRAS gene that is typically mutated. Generally, there's not a lot wrong with your pipeline; I wouldn't stress about the variant caller before actually having looked at the BAM file. Even if your mutation happens to be in a region for which you managed to capture reads, the depth will most likely be on the low end per single cell (I would guess below ten reads), so making a mutation call will most likely simply depend on manual annotation (if you are looking for known mutations).
If this is cancer, most KRAS mutations (~80-90% of KRAS mutant tumors) occur at just one of two amino acids residues within the protein (G12X or G13X, X=any amino acid). So it would be entirely possible to have a manual component.
But that means 3' biased sequencing will never cover those sites.
Yes, you are right. But there are also oncogenes that may have many mutations in the middle of a long protein and might not get any coverage for either strategy. Also illustrates why study design is important, as 5' biased sequencing might likely cover these sites.