Question

Confirming Structural Variants In The Genome

0

Entering edit mode

12.8 years ago

Rubal7 ▴ 850

Hi Everyone,

I'd like to get some advice/ideas on how to confirm that there is structural variation in a specific genomic region in a population. We sequenced individuals from two populations to high coverage, ~50X worth of data for each population. We detect a region of the genome ~500kb in size where we see double the amount of coverage in one population compared to the other. We therefore suspect the region is duplicated in one population (we also see a large increase in heterozygosity in the population with the higher coverage, which is presumably from mapping two genomic regions to one location).

We'd like to confirm that there is indeed a duplication here and that there is not just a random increase in coverage at one population at this site. One approach we are considering is searching for 'junction fragments', the reads that contain part of both the original sequence and the new duplicated sequence. Presumably these will not have been mapped as in the reference genome they have no close correlate. If anybody knows of a good way to do this or knows any papers or software that deal with this problem that would be great.

Any other ideas for confirming the presence of copy number variation is appreciated. Ideally methods that we could use on the existing data rather than resequencing.

*I should specify that we would also like to know exactly where the structural variant begins and ends.

Many thanks in advance

genome coverage • 3.6k views

ADD COMMENT • link updated 6.6 years ago by Biostar 20 • written 12.8 years ago by Rubal7 ▴ 850

0

Entering edit mode

What species? Are the "individuals" expected to be homogeneous, genetically?

ADD REPLY • link 12.8 years ago by Sean Davis 27k

score 3 · Answer 1 · 2012-06-28

I would use a second non-informatics technique to confirm this. You should be able to design 4-5 primers at various spots within and flanking your 500 kb duplication, and perform quantitative PCR. You don't say what n you are working with in your populations, but qPCR is a relatively efficient way of detecting copy number in large populations (96 -well plates at a time). The biggest problem is running enough controls, we usually use 3 housekeeping genes from non-duplicated regions. Array CGH would be painful if you have large numbers in your populations, expensive, and you really only want to know about this one region anyway. If there is a BAC within your area of duplication (look in UCSC) you could order probes for that and prove the duplication with FISH -- but that is more work than qPCR.

score 2 · Answer 2 · 2012-06-28

2

Entering edit mode

12.8 years ago

Vikas Bansal ★ 2.4k

If you really want to confirm, then I would suggest to use wet lab methods. Example - PCR, MLPA, CGH, FISH.

ADD COMMENT • link 12.8 years ago by Vikas Bansal ★ 2.4k

score 1 · Answer 3 · 2012-07-06

I think you can use your existing sequence data to help narrow down the breakpoints. There's a whole bunch of software and approaches for detecting structural variation - try http://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.1374.html for a start. You can use the paired -end reads to help - look for pairs where one of the reads is mapped in the 500kb region and one is mapped either to another chromosome, or more likely (if it's a tandem duplication) to another location on the same chromosome but the reads map further apart than you would expect from the library insert size. You can also use the single-end reads which have split mapping as you suggest above - try http://www.biomedcentral.com/1471-2105/13/S6/S6 , there's probably other software.

That should get the breakpoint region down to a range where you can PCR and sequence it with a single set of primers, at least in one patient, and the bioinformatic methods should get you close enough to know whether you're going to have closely clustered breakpoints in all your population or a wide range.

score 0 · Answer 4 · 2012-06-29

0

Entering edit mode

12.8 years ago

Rubal7 ▴ 850

Thanks Alex and Vikas, these are both useful ideas. I should have specified in my original question that I am also keen to know exactly where the copy number variation occurs, at nucleotide resolution, so that we can see if the variation is likely to be disrupting gene expression, for example by occuring in the middle of a gene. I can see how this could also be done with a large set of primers, but I believe in this case a simpler approach would be bioinformatic using the extensive sequence data we have. But perhaps I am wrong.

ADD COMMENT • link 12.8 years ago by Rubal7 ▴ 850

0

Entering edit mode

You could edit your question or use comments. If you want to know the place of duplication (in your case) i:e if you want to know, after duplication where does that duplicated region gets inserted in the genome, then only bioinformatic approach is not a good solution. Because, as in your case you found duplication in that 500kb region (assuming read depth approach), it tells you the region which got duplicated not where it got inserted after duplication.

ADD REPLY • link 12.8 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

True, I currently don't know where this region sits in the genome (presumably the most likely answer for a CNV is that its a tandem duplication) But I was thinking if we find the reads that are contain a certain portion of bases that align to the original region and the rest that map to an unexpected location these would tell us where the CNV occurs. I believe these reads are known as 'junction fragments' But without knowing exactly where the CNV ends it leaves a very large search space of possible unmapped reads to check through.

ADD REPLY • link 12.8 years ago by Rubal7 ▴ 850

1

Entering edit mode

I don't think I would rely on the approach you suggest to tell you anything with certainty. If you want to verify the duplication exists -- use qPCR. If you want to visualize where the duplication is inserted you need to use FISH. I would not assume the duplication is in tandem at all without any evidence to that fact. The dup could be on a supernumerary marker chromosome for all you know. If you want to see if gene expression is affected by the duplication you need to do some mRNA work. Otherwise, you are left with predictions that are not biologically validated.

ADD REPLY • link 12.8 years ago by Alex Paciorkowski 3.5k

0

Entering edit mode

I agree with Alex. Just curious, how long your reads are?

ADD REPLY • link 12.8 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

125bp single end. We have some paired end reads that could also be very useful in this regard.

ADD REPLY • link 12.8 years ago by Rubal7 ▴ 850