Question

Difficult to Sequence Regions for WGS

0

Entering edit mode

9.0 years ago

gmcinnes • 0

I'm looking for BED files delimiting difficult to sequence regions of the genome, such as regions with high AT content, high GC content, homopolymer repeats, etc. Does anyone know if anything like this is publicly available? I have tried looking on the UCSC table browser for tracks but I don't see any that fit this description.

Thanks

wgs next-gen-sequencing • 3.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by gmcinnes • 0

Ram · Answer 1 · 2015-11-16

I think you can try to use UCSC mappability track. the Blacklisted Regions maybe what you want.

Alignability

The CRG Alignability tracks display how uniquely k-mer sequences align to a region of the genome. To generate the data, the GEM-mappability program has been employed. The method is equivalent to mapping sliding windows of k-mers (where k has been set to 36, 40, 50, 75 or 100 nts to produce these tracks) back to the genome using the GEM mapper aligner (up to 2 mismatches were allowed in this case). For each window, a mappability score was computed (S = 1/(number of matches found in the genome): S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on). The CRG Alignability tracks were generated independently of the ENCODE project, in the framework of the GEM (GEnome Multitool) project.

Uniqueness

The Duke Uniqueness tracks display how unique each sequence is on the positive strand starting at a particular base and of a particular length. Thus, the 20 bp track reflects the uniqueness of all 20 base sequences with the score being assigned to the first base of the sequence. Scores are normalized to between 0 and 1, with 1 representing a completely unique sequence and 0 representing a sequence that occurs more than 4 times in the genome (excluding chrN_random and alternative haplotypes). A score of 0.5 indicates the sequence occurs exactly twice, likewise 0.33 for three times and 0.25 for four times. The Duke Uniqueness tracks were generated for the ENCODE project as tools in the development of the Open Chromatin: DNaseI HS,FAIRE, TFBS and Synthesis tracks.

Blacklisted Regions

The DAC Blacklisted Regions aim to identify a comprehensive set of regions in the human genome that have anomalous, unstructured, high signal/read counts in next gen sequencing experiments independent of cell line and type of experiment. There were 80 open chromatin tracks (DNase and FAIRE datasets) and 20 ChIP-seq input/control tracks spanning ~60 human tissue types/cell lines in total used to identify these regions with signal artifacts. These regions tend to have a very high ratio of multi-mapping to unique mapping reads and high variance in mappability. Some of these regions overlap pathological repeat elements such as satellite, centromeric and telomeric repeats. However, simple mappability based filters do not account for most of these regions. Hence, it is recommended to use this blacklist alongside mappability filters. The DAC Blacklisted Regions track was generated for the ENCODE project.

The Duke Excluded Regions track displays genomic regions for which mapped sequence tags were filtered out before signal generation and peak calling for Open Chromatin: DNaseI HSand FAIRE tracks. This track contains problematic regions for short sequence tag signal detection (such as satellites and rRNA genes). The Duke Excluded Regions track was generated for the ENCODE project.