How can I mask a sequence with SNPs depending on MAF? The sequence I am interested in is human build 37 and I'd like to mask SNPs that have frequencies of >1% or >5% in dbSNP. Is there some resource out there with common SNPs already masked?
How can I mask a sequence with SNPs depending on MAF? The sequence I am interested in is human build 37 and I'd like to mask SNPs that have frequencies of >1% or >5% in dbSNP. Is there some resource out there with common SNPs already masked?
get a BED file of the SNPs you want to discard http://genome.ucsc.edu/cgi-bin/hgTables?command=start group:variation All_Snp138 , filter->create->avHet
then use maskfasta to mask the reference: http://bedtools.readthedocs.org/en/latest/content/tools/maskfasta.html
I have never used this tool but it seems useful for what you want to achieve.
http://genomecomb.sourceforge.net/docs/cg_genome_seq.html
(This command returns the sequences of the genomic regions given in the file region file in fasta format (to stdout or to a file outfile). Regionfile is a tab delimited file with at least following columns: chromosome begin end. Repeatmasker repeats are soft masked (lower case) in the output sequences. Optionally you can hardmask repeats, and soft or hardmask known (dbsnp) variants based on frequency.)
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/snp138Mask/ (Already masked reference fasta based on dbSNP)
Thanks, Ashutosh. I'll take a look at this! I was originally using the fastas from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/snp138Mask/ but it included too many SNPs. I only want to mask the high-frequency SNPs and preferably only the SNPs which are high-frequency in Asian populations.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks, Pierre. That seems like it would work but I can't find any documentation on what the "avHet" filter is. Is that average heterozygosity?