Question

About Accessible Genome Mask

2

Entering edit mode

8.1 years ago

SOHAIL ▴ 410

Dear all,

My this questions is bit descriptive and i need some clarifications.

I read 1000 Genome project paper entitle: "A global reference for human genetic variation". http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html#supplementary-information

In the supplementary information "9.2 Callable genome mask", authors provided two types of Accessible genomic mask regions: "Pilot" and "Strict". The reasons of generating such regions were: (i.e. quoted)

"Due to the nature of short-read sequencing, the sequencing depth varies along the length of the genome. As such, not all regions of the genome will have equal power for variant discovery. To provide an assessment of the regions of the genome that are accessible to the next-generation sequencing methods used in Phase 3, we created two genome masks".

These most recent version of bed files are provided here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/

My purpose: I want to use these genomic masks (hard filters) in my variant filtering step. So, while performing population genetic analysis (such as estimates of mutation rate) that must focus on genomic regions with very low false positive and false negative rates were performed correctly.. (in short to get high confidence variant sites of genome)

Problem: 1. I am bit confused either to focus the variants present in these regions (i.e. accessible mask regions) or ignore the variants present in these regions?? I had 9556898 total bi-allelic SNPs in total. I used "20141020.strict_mask.whole_genome.bed" file and filtered-out 6611479 variants present in these regions and finally got 2945419 remaining variants behind. In this way more than 50% variants were lost.

I am confused either the SNPs present in these regions we have to consider or filter-out, may be i am confusing the "mask" keyword associated with the files.

Can anyone explain and help me how to use these files??

Thank you very much for patience and help in advance!

ngs 1000 genomes • 4.4k views

ADD COMMENT • link updated 7.5 years ago by J.Rodrigo Flores ▴ 50 • written 8.1 years ago by SOHAIL ▴ 410

score 3 · Answer 1 · 2017-05-28

Hi,

The coordinates present in both the pilot and strict mask bed files, they represent all those bases in the genome that according to the criteria of the 1K genomes project are fully 'callable' and therefore represented as 'P' bases ("passed all filters"; see links below).

A simple test that validates this is the following. You can count the lengths of all the regions present in one of the bed files and then sum them up. Let's say the strict one. Then you can grab the fasta genome encoded with the letters N,L,H,Z,Q,P,0, depending on the callable status of each base and rules set by the project regarding these masks (see links below), and count the number of 'P' letters (bases that passed all filters) in total. Both numbers should match. I did the previous for one chromosome and it holds up.

Links:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/StrictMask/ ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/README.accessible_genome_mask.20140520

J. Rodrigo Flores