identifying which SNPs sit in TFBS (Yeast)
1
0
Entering edit mode
9.0 years ago
grins38 ▴ 10

I have a set of ~11k SNPs for Saccharomyces cerevisiae, baker's yeast and I would like to identify which ones of these sit in transcription factor binding sites and if they do information on the relevant TFBS.

I've scoured this site and the internet and I couldn't find a downloadable database that would give me locations of all known/verified TFBS for yeast. I've studied the YEASTRACT website thoroughly but didn't find such database to download information about TFBS for all the ORFs/genes in one go.

Moreover, when I tried getting the information manually through YEASTRACT search on-line I found results confusing. for example: searching for TF for ORF "YOL166W-A" returns a list of 4 TFs. clicking on one of them, say Sok2p, takes you to another page which says, amongst other things, that the corresponding TFBS is "acMTGCAKg"... what does it mean? (i know the ACTG alphabet but what are the 'a', 'c', 'K' and 'g' symbols?) and what does this tell me about the actual location of the binding site? do I have to BLAST the whole genome to identify it? (shouldn't there be a database with this info for yeast already?) if so, how do I BLAST for symbols like 'K' and 'g'?

I have background in stats but currently work in genetics applications, hence extracting relevant bioinformatics data is very confusing for me on occasions. any help will be appreciated.

binding-site TFBS • 2.3k views
ADD COMMENT
2
Entering edit mode
9.0 years ago

I haven't done yeast before, but if you get the MEME-formatted position weight matrices (PWMs) from YEASTRACT, for instance, for transcription factors of interest, you should be able to use those MEME-formatted PWMs in conjunction with a site prediction tool like FIMO, a part of the MEME toolkit that calls binding sites across your specified (FASTA-formatted) genome at or below your specified level of statistical significance.

SNPs can be in a format called VCF, and FIMO output can be in GFF format. In those cases, you can use vcf2bed and gff2bed in the BEDOPS toolkit to write SNP and FIMO results to sorted BED files. For example:

$ vcf2bed < SNPs.vcf > SNPs.bed
$ gff2bed < TFBSs.gff > TFBSs.bed

Once you have TFBSs in a sorted BED file, and your SNPs in a sorted BED file (however you do this, whether via BEDOPS tools or anything else) you can use set operation tools like BEDOPS bedmap with --echo and --echo-map- operators to print SNPs that overlap ("map to") TFBSs, *e.g.:

$ bedmap --echo --echo-map --delim '\t' TFBSs.bed SNPs.bed > TFBSs_with_overlapping_SNPs.bed

Each line of output is a TF binding site, and any SNPs that overlap that binding site by one or more bases.

You can change the operators to get different information. If you just want a list of unique SNP IDs, for example:

$ bedmap --echo --echo-map-id-uniq --delim '\t' TFBSs.bed SNPs.bed > TFBSs_with_overlapping_SNP_IDs.bed

If you want to customize overlap threshold between a TF binding site and a SNP, you can add overlap parameters. For instance, to ensure a SNP falls entirely within a TFBS, you can add --fraction-map 1:

$ bedmap --echo --echo-map --fraction-map 1 --delim '\t' TFBSs.bed SNPs.bed > TFBSs_with_entirely_contained_SNPs.bed

Some binding sites may not have any overlap with SNPs, which you might not be interested in. You could add --skip-unmapped to just print binding sites with SNP overlaps:

$ bedmap --echo --echo-map-id-uniq --delim '\t' --skip-unmapped TFBSs.bed SNPs.bed > Only_TFBSs_with_overlapping_SNP_IDs.bed

Etc.

ADD COMMENT
0
Entering edit mode

thank you for this! i shall try this in the coming week

ADD REPLY

Login before adding your answer.

Traffic: 3086 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6