How can I mask a sequence with SNPs depending on MAF? The sequence I am interested in is human build 37 and I'd like to mask SNPs that have frequencies of >1% or >5% in dbSNP. Is there some resource out there with common SNPs already masked?
How can I mask a sequence with SNPs depending on MAF? The sequence I am interested in is human build 37 and I'd like to mask SNPs that have frequencies of >1% or >5% in dbSNP. Is there some resource out there with common SNPs already masked?
get a BED file of the SNPs you want to discard group:variation All_Snp138 , filter->create->avHet
then use maskfasta to mask the reference:
I have never used this tool but it seems useful for what you want to achieve.
(This command returns the sequences of the genomic regions given in the file region file in fasta format (to stdout or to a file outfile). Regionfile is a tab delimited file with at least following columns: chromosome begin end. Repeatmasker repeats are soft masked (lower case) in the output sequences. Optionally you can hardmask repeats, and soft or hardmask known (dbsnp) variants based on frequency.) (Already masked reference fasta based on dbSNP)
Thanks, Ashutosh. I'll take a look at this! I was originally using the fastas from but it included too many SNPs. I only want to mask the high-frequency SNPs and preferably only the SNPs which are high-frequency in Asian populations.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks, Pierre. That seems like it would work but I can't find any documentation on what the "avHet" filter is. Is that average heterozygosity?