I am trying to think of an approach to understand mutation biases in my data if they exist, but I can't think of a good method. The idea is that I see particular base change biases in my genome sequencing data which match similar published data for example C mutates to T more often than C mutates to A. It is relatively straightforward to just look at the output vcf file and get this information.
However, now I would like to look for more complex patterns. For instance perhaps C often changes to T but the majority of the time this is only in the context of ACA -> ATA, because that the surrounding bases influence the error rate. Similarly we could imagine that any number of surrounding bases might influence the error rate such that perhaps AAACAAA -> AAATAAA is the most prevalent C -> T change.
So I am looking for some guidance or suggestions on how to proceed with an analysis like this. I know some labs have performed and published this type of data but I can't think of how to do it myself.
Oh cool I wasn't aware of
bedtools getfasta
. That is helpful.