Question

Gene search with known start and end site

0

Entering edit mode

9.1 years ago

tanni93 ▴ 50

I have a large file with open chromatin start and end sites. I want to run this file (can be in bed/bam/fastA/etc format after conversion) to find known genes around the start and end sites (500 kB from the start and end sites). What software may I use to find known genes around the start and end sites? Thank you!

ChIP-Seq • 2.1k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.1 years ago by tanni93 ▴ 50

score 1 · Answer 1 · 2015-10-12

1

Entering edit mode

9.1 years ago

Ram 44k

If you have the chromosomal co-ordinates, you can use UCSC Genome Browser's MySQL schema to design appropriate SQL queries.

ADD COMMENT • link 9.1 years ago by Ram 44k

score 1 · Answer 2 · 2015-10-12

You could accomplish this with Bedtools. First you want to create a BED file of regions which flank your open chromatin sites:

bedtools flank -i chromatin_sites.bed -g my.genome -b 500000 > flanking.bed

Next, get a BED file of genes for your particular genome. Go to the UCSC table browser, specify your genome/assembly, choose track RefSeqGenes and output format BED. Click get output and then create one BED record per whole gene. This will create a BED file of gene coordinates.

Finally, intersect the flanking regions with your genes:

bedtools intersect -a flanking.bed -b genes.bed -wa -wb > flanking_genes.bed

What is produced will be a tab-separated value file where each line lists an overlap between your flanking sites and any intersecting genes.

Ram · Answer 3 · 2015-10-12

Here's a way to use the BEDOPS toolkit to map Gencode human genes to the 500 kB regions outside your sites.

You could use other annotations, depending on your experiment and what kinds of gene annotations you are interested in.

First, grab the Gencode annotations, convert to BED, and filter them for gene records:

$ wget -O - ftp://ftp.sanger.ac.uk//pub/gencode/Gencode_human/release_18/gencode.v18.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene - \
    > gencode.v18.genes.bed

Then use bedops --range to pad the sites in a sorted version of sites.bed, take the --difference of the padded sites and the original sites to get just the 500 kB padded regions, and then use bedmap --echo-map-id-uniq to build a list of unique gene names, for Gencode genes that overlap the padded regions by one or more bases:

$ sort-bed sites.bed \
    | bedops --range 500000 --everything - \
    | bedops --difference - sites.bed \
    | bedmap --echo --echo-map-id-uniq - gencode.v18.genes.bed \
    > answer.bed

The file answer.bed will contain each 500 kB window and all gene IDs for overlapping genes.