Reasons Why A Region In The Human Genome May Get Bad Sequence/Alignment Using Illumina Platform?
4
6
Entering edit mode
12.9 years ago
Biomed 5.0k

We identified certain regions in the genome where there is significantly low coverage across individuals. One of those is a high GC content region (70%) so that might be one explanation but I am wondering what may be other reasons why we can't get good sequencing/alignment in this region. This question can be generalized as "what are the characteristics of a region of the genome that would make it hard to sequence and align so a reliable genotype calling can't be performed using Illumina next-gen sequencing". Other platforms have their unique problems but we are interested in Illumina platform related issues.

Some issues we hypothesize to be important are:

Paralogous regions
Repeats
Segmental Duplications

Do you have other things that we can add to this list of things to check? Thank you

illumina next-gen sequencing • 7.5k views
ADD COMMENT
1
Entering edit mode

Great question. Very interested in reading what people post on this topic.

ADD REPLY
4
Entering edit mode
12.9 years ago
brentp 24k

You could view your region in the UCSC genome browser with a mappability track loaded. That defines something like the ability to uniquely map an N basepair read with up to X mismatches.

You can also load the sedgups and repeats tracks to check your ideas.

ADD COMMENT
4
Entering edit mode
12.9 years ago

Wang et al (and possibly others) show that for Illumina data "sequence coverage increases with GC-content increase when GC-content is less than 40–45%, but decreases when GC-content is more than 50–55%, with the peak at around 45%". Biases like these are hard to find solid evidence for in the literature (because the vendors may try to suppress this information or the high cost of nailing issues like these down), so I am not aware of the underlying cause being explicitly worked out in detail. However, one further mechanism to add to your list is that GC-rich regions do not amplify well by bridge PCR.

ADD COMMENT
2
Entering edit mode

The region biomed found is certainly caused by the high GC content. Illumina can barely sequence GC>80% even given relatively high coverage. Nonetheless, I do not think Illumina is hiding this. Even in Bentley et al. (2008), they presented a plot (done by Aylwyn from Richard's group) showing that the sequence coverage is significantly lower for high GC. Yes, PCR is to blame.

ADD REPLY
0
Entering edit mode

Thanks for the comments. Do you have additional factors to consider for other regions that are not necessarily high GC but have other causes for low coverage due to sequencing(PCR) and/or bad alignment?

ADD REPLY
0
Entering edit mode

Bentley et al. (2008) do show how coverage decreases with extreme GC-content, but they downplay this result, saying it only affects " just 1% of unique chromosome sequence" -- what they don't say is that much of this 1% is mainly found around CpG-island-containing promoters and thus there are major biases for chip-seq using Illumina in mammals (as there also is for other platforms with GC-based coverage biases).

ADD REPLY
2
Entering edit mode
12.9 years ago

Just a trial! An hypervariable region ? (e.g. immunoglobulin).

ADD COMMENT
0
Entering edit mode

thanks for the input

ADD REPLY
1
Entering edit mode
12.9 years ago

Another reason, which at times can overlap with the possibilities offered by Brent and Casey, is the ability of the replication/sequencing machinery (ie, enzymes) to replicate a given DNA strand as part of the sequencing reaction. Some DNA regions are poorly copied by PCR or reverse transcriptase (or are even toxic when transfected into bacteria). One explanation for this is the structure of the DNA, and that could be due to GC-content, but not always. Other factors could come into play.

While your list of paralogous regions, repeats and segmental duplications can be identified by computational approaches, not all structural variants can be so readily identified. Just as Casey wrote, mechanisms are not known in sufficient detail such that an algorithm can be written.

ADD COMMENT
2
Entering edit mode

Just to add that this "toxic DNA" is mainly applied to Sanger sequencing, where we mostly insert a ~3kb sequence in plasmid. Protein fragments from this 3kb sequence may kill the bacteria. I am not sure if this is an issue any more for high throughput sequencing.

ADD REPLY
1
Entering edit mode

I agree, but some of the "toxicity" was thought to arise from structural features of the insert or insert+vector combo. Thus I added it, but parenthetically.

ADD REPLY
0
Entering edit mode

Just to add that this "toxic DNA" is mainly applied to Sanger sequencing, where we mostly put a 3kb sequence in plasmid. Protein fragments from this 3kb sequences may kill the bacteria. I am not sure if there is an issue any more for high throughput sequencing.

ADD REPLY

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6