Decoy In Reference Assembly
2
9
Entering edit mode
11.5 years ago
Sangwoo Kim ▴ 440

I am using 1000 Genomes data with my new project. When I am inspecting the reference assembly they have been using, I found it contains a "decoy" contig.

The 1000 Genomes FAQ says:

For the final round of alignments the sequence data will be mapped to a set of sequences derived from the GRCh37 assembly. This GRCh37-derived alignment set includes chromosomal plus unlocalized and unplaced contigs, the rCRS mitochondrial sequence (AC:NC_012920), Human herpesvirus 4 type 1 (AC:NC_007605) and decoy sequence derived from HuRef, Human Bac and Fosmid clones and NA12878. These files are available in phase2_reference_assembly_sequence on the ftp site. All human variant coordinates reported by the 1000 Genomes project are in GRCh37 coordinates.

Here, I have no idea what the decoy sequence is and why it is included. Maybe to detect sample contamination?

1000genomes • 19k views
ADD COMMENT
2
Entering edit mode

BTW, for more information, check the 1000g ftp.

ADD REPLY
39
Entering edit mode
11.5 years ago

The reference genome is incomplete, particularly around the centromeres, so often reads which truly belong elsewhere are wrongly mapped to a particular place in the genome because the true match is missing from the reference. These cause false positive calls, which were bothering us in the 1000 Genomes Project. The decoy is a pragmatic solution to this - it contains known true human genome sequence that is not in the reference genome, and will "suck up" reads that would otherwise map with low quality in the reference. The decoy was built by Heng Li, at the Broad, working with Richard Durbin (Sanger) and Deanna Church of the Genome Reference Consortium (who maintain the reference genome).

ADD COMMENT
2
Entering edit mode

Thank you for the great answer. So basically, hs37d5 is GRCh37 + decoy. GRCh37 also have many small contigs (e.g. GL000xx.1) and a human herpesvirus (NC_007605) sequence. My understanding is that the goal of the decoy is same but it's only built artificially?

ADD REPLY
2
Entering edit mode

Yes, I guess the decoy has a similar goal to the herpes virus sequence, except that it is removing true human sequence just because the reference is incomplete.

ADD REPLY
2
Entering edit mode

Thank you again!

ADD REPLY
0
Entering edit mode
18 months ago
kmzhou4 • 0

I am trying to get one questions answered. With the last release GCF_000001405.40_GRCh38.p14_genomic.fna for patch_14 from NCBI web page, is there a corresponding version of decoy for it? Another question, what's the positive impact of removing ALT loci on alignment (BWA) and subsequent analysis for variant both nucleotide (SNV) and structural (SV). I can understand, by masking the telomeric sequences (in decoy), the aligner can be more specific (less mislead to wrong alignment, and faster by not exploring not very useful alignment in terms a variant analysis)

ADD COMMENT

Login before adding your answer.

Traffic: 2279 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6