Number of Tandem and Interspersed duplications
1
0
Entering edit mode
5.3 years ago

Dear all,

I can not find this information somehow (I've checked the 3rd phase of 1000GP project, they refer to tandem duplication only if it was called by DELLY and it is not a good criteria to compare, and I've checked biostars, no real answer here even if some questions were a bit similar...)

We have several types of duplications. Tandem ones and interspersed. Usually people refer to all interspersed duplications as "segmental duplcations", but for me it sounds like high allele frequency.

So the question is: how many interspersed duplications in a standard human individual genome exists, if we filter out variants with allele frequency >1%? I don't need to know the actual number; percentage in comparison with tandem ones would be fine.

(motivation for this question: most of the SV calling tools detect tandem duplications, but not interspersed ones, and usually variants with >1% mAF are considered as not important - how much I will miss per a human genome if I just don't call interspersed duplications?)

CNV duplication SV • 2.0k views
ADD COMMENT
0
Entering edit mode

to be honest , I've never heard of calling interspersed duplications as 'segmental duplications'. Just as the name says segmental duplication are duplications of complete segments of the genome (not just a single or few genes). After time these might look like interspersed duplications because many of the duplicate genes in that segment will be 'removed' over time and only a few recognizable ones are kept.

tandems and interspersed duplicated genes are the result of small scale duplications ( a continuous process in any genome ), segmental duplications are large scale (only happen, or at least are 'fixed', once so often in the evolution of a genome)

ADD REPLY
0
Entering edit mode

yeap, sometimes they do =) https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz237/5425335 The problem in SV calling is there is no strict terminology. These guys use the different ideas behind "interspersed segmental duplications". They use the word "segment" to denote the segment, without an assumption on the amount of genes inside this segment...

UPD: I was wrong, they used Chaisson definition of SD https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4745987/ , they are actually assumed as large, whatever it means

ADD REPLY
0
Entering edit mode
5.3 years ago
d-cameron ★ 2.9k

At a high level, there are two classes of rearrangement detection. Those relying on copy number, and those relying on breakpoint detection.

Usually people refer to all interspersed duplications as "segmental duplcations"

These are typically called by copy number callers. Such segmental duplication calls make no claim regarding the location in the genome - they could be simple tandem duplication, or they could be interspersed.

motivation for this question: most of the SV calling tools detect tandem duplications, but not interspersed ones

In general, when you talk about SV detection, you're really talking about breakpoint detection. Many (most?) SV calling tools will indeed detect interspersed duplications, they just won't report them as such. Unlike a simple tandem duplication, an interspersed duplication will have two breakpoints. One from the donor site to insertion position, and the other from the end of the donor site to the insertion position (but in the opposite orientation). Any caller that can report inter-chromosomal breakpoints, will be capable of reporting the breakpoints involved in an interspersed duplication. I'm not aware of any SV callers that will classify such events as interspersed duplications - that's typically down by downstream analysis tools that combine the SV calls with CN callers to do rearrangement classification/interpretation. Unfortuantely, there are not many tools that do this well. The best one I know of is LINX^. It's a somatic-only tool but the logic it uses for LINE insertion detection uses the same principles one would use for 'interspersed duplications' in general.

TDLR: SV and CNV callers detect interspersed duplications, they just don't call them that.

^ disclaimer: I'm involved in the development of this tool.

ADD COMMENT
0
Entering edit mode

In general, when you talk about SV detection, you're really talking about breakpoint detection.

Not really, I am actually interested in genes' dosage, I do not care where EGFR gene was insterted, I care if it was copied more than 2 times.

Yeah, the answer is good, but it is kinda irrelevant to the question =( I use e.g. DELLY, it detects tandem duplications only - should I polish my callset with a read depth method to find interspersed duplications? Or their proportion is so low so I can just rely on DELLY's results? Read depth methods can often call FP results, I'd like to avoid this step.

ADD REPLY
0
Entering edit mode

If you don't care where they are inserted then you're actually asking a different question. What you want to be asking is how many copy number variants (do you only care about gain or loss as well?) there exist in a human genome. This information is readily available in database such as dgv (http://dgv.tcag.ca)

TLDR: an interspersed duplication is a reported as a copy number gain by a CNV caller. The only difference is the terminology referring to the event.

Edit: also, you don't want a SV caller, you want to run a CNV caller. The exception to this is if you want to detect small (<10kb) copy number changes in which case you'll have to run both.

ADD REPLY
0
Entering edit mode

You are absolutely right. I am looking mainly for germline copy-number variations, and since the application is clinical, <10kb is not just a desired resolution, but must-have resolution. The problem is the method. DELLY calls CNVs using Paired-End Mapping + Read-Depth - and, normally, FDR of DELLY is quite reasonable. Pure Read-Depth methods are not usually that good for <10kbp variants. I have to use Read Depth methods for WES data, and people complain about false discoveries. But now we have hundreds of WGS and the question was - what if I will not use Read-Depth methods anymore? How many duplications I will loose? DELLY detects only tandem ones (at least the author claims so). To answer this question, I can just run DELLY + smth like CNVnator and compare the numbers, but there are obvious drawbacks, that's why I asked the community - what are the validated numbers? can read-depth method be avoided? (after I started to do benchmarking, I saw hundreds of duplications detected by RD and not by PEM, so may be the answer is just "you still have to use both", but I still struggle to find enough evidence)

ADD REPLY
0
Entering edit mode

Part of the problem is that many of the germline events are in regions that are difficult to call (microsattelite expansions are a good example of this). Have you considered using tools other than Delly?

There have been a pair of benchmarking papers that came out recently

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1720-5

https://www.nature.com/articles/s41467-019-11146-4

Both of which show that DELLY is outperformed (especially for duplications) by more recent tools.

can read-depth method be avoided?

Unfortunately the paper that included a comprehensive breakdown of sensitivity by event size and type, only included SV callers and not CNV callers so we don't have an answer to that question with rigorous data to back it up.

ADD REPLY
0
Entering edit mode

Thanks a lot! I thought the same, but still asked =) DELLY is a magic tool. It may be outperformed by others, but when the analysis is performed by the author (Rausch) - it outperforms them back =) this random forest post-filtering is really good. There was an open competition of SV calling - and EMBL submitted 2 results, and wow, DELLY does a good job with the post-filtering. Thanks a lot for the papers - will check them out!

ADD REPLY

Login before adding your answer.

Traffic: 2039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6