"I have problem with interpreting CNVs" - my friend, step into my office.
Yeah, it's just a really difficult thing to study using short read sequencing, in particular in sequences with high homology, etc....
Here's the first thing. It is unclear to me if you are talking about indels or large copy number variants. You say "variants that introduce frameshift mutations". This more strongly suggests an indel to me, but what you say is copy number variant. Although CNV is used variably, I think people more often think of a much longer sequence than what an indel is defined to be....
The reason this matters is because there is no universally most accurate software out there that is recognized or agreed upon to be the best for all classes of genetic variants. When the scientists who write these algorithms do what they do, they frequently are optimizing for accuracy in a certain case or cases. Therefore, its always best (in my opinion) to at least skim the manuscript associated with a given algorithm, and see what the intent of the software is. If you need to optimize for indels, there are options, if you need to optimize for SVs or CNVs there are options, etc., but its best to pick the right tool for the job.
What I'd recommend is that you check out a few review articles to get you started. As you skim these, I would encourage you to 1. quickly list the available options 2. get a sense for HOW they are doing what they do (sidenote, the reason for 2. is that there are different ways to pick up CNVs and indels; for instance some software is written specifically to look for reads that contain a breakpoint ...
This search https://pubmed.ncbi.nlm.nih.gov/?term=Copy+Number+Variation+AND+Software&filter=pubt.review&filter=datesearch.y_5&size=200 could help. Take a look at article 8, for example: https://pubmed.ncbi.nlm.nih.gov/30965134/. This is a review article comparing such tools written in the last 5 years. Could be a good place to start.
So that's the other thing. I could not provide a simpler answer to your question because i was ultimately unsure if you were talking about CNV or indels ...
it sounds like indels, right? Heres the thing. The answer to your question is "if the change in coding sequence is not divisible by three" is yes, but there are some caveats.
1) sometimes the software will spit out what is really one variant as two, or really two variants as 1. In other cases, it will correctly report the number of variants, but the total number of lost bases could add up to a number N such that N % 3 == 1. For example, what if you had a deletion of 7 base pairs, but then, two bases later, and insertion of 4 base pairs??? The NET change would be 3 % 3 == 1.
Now, finally, to answer the question:
I want to be clear that l still recommend you use a variant caller designed for Indels.
Having said that, you can get a pretty rapid answer to your question via biomaRt
.
library(biomaRt)
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
listMarts(human)
returns:
Description
1 Gene stable ID
2 Transcript stable ID
3 Transcript stable
4 ID
.....
Now, if you scroll down to row 237 and 238, you will see cDNA coding start
and cDNA coding end
.
After confirming FOR SURE that you're on the same build, same transcript, etc. etc. etc., now you know whether to place two indels in the same exon or not. As such you can sum the net change and see if it is divisible by 3 or not.
https://www.ensembl.info/2020/03/27/cool-stuff-the-ensembl-vep-can-do-annotating-structural-variants/