Question

Reasons Why A Region In The Human Genome May Get Bad Sequence/Alignment Using Illumina Platform?

6

Entering edit mode

12.9 years ago

Biomed 5.0k

We identified certain regions in the genome where there is significantly low coverage across individuals. One of those is a high GC content region (70%) so that might be one explanation but I am wondering what may be other reasons why we can't get good sequencing/alignment in this region. This question can be generalized as "what are the characteristics of a region of the genome that would make it hard to sequence and align so a reliable genotype calling can't be performed using Illumina next-gen sequencing". Other platforms have their unique problems but we are interested in Illumina platform related issues.

Some issues we hypothesize to be important are:

Paralogous regions
Repeats
Segmental Duplications

Do you have other things that we can add to this list of things to check? Thank you

illumina next-gen sequencing • 7.5k views

ADD COMMENT • link updated 12.9 years ago by Larry_Parnell 16k • written 12.9 years ago by Biomed 5.0k

1

Entering edit mode

Great question. Very interested in reading what people post on this topic.

ADD REPLY • link 12.9 years ago by 2184687-1231-83- ★ 5.1k

score 4 · Answer 1 · 2011-12-16

4

Entering edit mode

12.9 years ago

brentp 24k

You could view your region in the UCSC genome browser with a mappability track loaded. That defines something like the ability to uniquely map an N basepair read with up to X mismatches.

You can also load the sedgups and repeats tracks to check your ideas.

ADD COMMENT • link 12.9 years ago by brentp 24k

score 4 · Answer 2 · 2011-12-16

4

Entering edit mode

12.9 years ago

Casey Bergman 18k

Wang et al (and possibly others) show that for Illumina data "sequence coverage increases with GC-content increase when GC-content is less than 40–45%, but decreases when GC-content is more than 50–55%, with the peak at around 45%". Biases like these are hard to find solid evidence for in the literature (because the vendors may try to suppress this information or the high cost of nailing issues like these down), so I am not aware of the underlying cause being explicitly worked out in detail. However, one further mechanism to add to your list is that GC-rich regions do not amplify well by bridge PCR.

ADD COMMENT • link 12.9 years ago by Casey Bergman 18k

2

Entering edit mode

The region biomed found is certainly caused by the high GC content. Illumina can barely sequence GC>80% even given relatively high coverage. Nonetheless, I do not think Illumina is hiding this. Even in Bentley et al. (2008), they presented a plot (done by Aylwyn from Richard's group) showing that the sequence coverage is significantly lower for high GC. Yes, PCR is to blame.

ADD REPLY • link 12.9 years ago by lh3 33k

0

Entering edit mode

Thanks for the comments. Do you have additional factors to consider for other regions that are not necessarily high GC but have other causes for low coverage due to sequencing(PCR) and/or bad alignment?

ADD REPLY • link 12.9 years ago by Biomed 5.0k

0

Entering edit mode

Bentley et al. (2008) do show how coverage decreases with extreme GC-content, but they downplay this result, saying it only affects " just 1% of unique chromosome sequence" -- what they don't say is that much of this 1% is mainly found around CpG-island-containing promoters and thus there are major biases for chip-seq using Illumina in mammals (as there also is for other platforms with GC-based coverage biases).

ADD REPLY • link 12.9 years ago by Casey Bergman 18k

score 2 · Answer 3 · 2011-12-16

2

Entering edit mode

12.9 years ago

Manu Prestat 4.1k

Just a trial! An hypervariable region ? (e.g. immunoglobulin).

ADD COMMENT • link 12.9 years ago by Manu Prestat 4.1k

0

Entering edit mode

thanks for the input

ADD REPLY • link 12.9 years ago by Biomed 5.0k

score 1 · Answer 4 · 2011-12-16

1

Entering edit mode

12.9 years ago

Larry_Parnell 16k

Another reason, which at times can overlap with the possibilities offered by Brent and Casey, is the ability of the replication/sequencing machinery (ie, enzymes) to replicate a given DNA strand as part of the sequencing reaction. Some DNA regions are poorly copied by PCR or reverse transcriptase (or are even toxic when transfected into bacteria). One explanation for this is the structure of the DNA, and that could be due to GC-content, but not always. Other factors could come into play.

While your list of paralogous regions, repeats and segmental duplications can be identified by computational approaches, not all structural variants can be so readily identified. Just as Casey wrote, mechanisms are not known in sufficient detail such that an algorithm can be written.

ADD COMMENT • link 12.9 years ago by Larry_Parnell 16k

2

Entering edit mode

Just to add that this "toxic DNA" is mainly applied to Sanger sequencing, where we mostly insert a ~3kb sequence in plasmid. Protein fragments from this 3kb sequence may kill the bacteria. I am not sure if this is an issue any more for high throughput sequencing.

ADD REPLY • link 12.9 years ago by lh3 33k

1

Entering edit mode

I agree, but some of the "toxicity" was thought to arise from structural features of the insert or insert+vector combo. Thus I added it, but parenthetically.

ADD REPLY • link 12.9 years ago by Larry_Parnell 16k

0

Entering edit mode

Just to add that this "toxic DNA" is mainly applied to Sanger sequencing, where we mostly put a 3kb sequence in plasmid. Protein fragments from this 3kb sequences may kill the bacteria. I am not sure if there is an issue any more for high throughput sequencing.

ADD REPLY • link 12.9 years ago by lh3 33k