Gene search with known start and end site
3
0
Entering edit mode
9.1 years ago
tanni93 ▴ 50

I have a large file with open chromatin start and end sites. I want to run this file (can be in bed/bam/fastA/etc format after conversion) to find known genes around the start and end sites (500 kB from the start and end sites). What software may I use to find known genes around the start and end sites? Thank you!

ChIP-Seq • 2.1k views
ADD COMMENT
1
Entering edit mode
9.1 years ago
Ram 44k

If you have the chromosomal co-ordinates, you can use UCSC Genome Browser's MySQL schema to design appropriate SQL queries.

ADD COMMENT
1
Entering edit mode
9.1 years ago
James Ashmore ★ 3.5k

You could accomplish this with Bedtools. First you want to create a BED file of regions which flank your open chromatin sites:

bedtools flank -i chromatin_sites.bed -g my.genome -b 500000 > flanking.bed

Next, get a BED file of genes for your particular genome. Go to the UCSC table browser, specify your genome/assembly, choose track RefSeqGenes and output format BED. Click get output and then create one BED record per whole gene. This will create a BED file of gene coordinates.

Finally, intersect the flanking regions with your genes:

bedtools intersect -a flanking.bed -b genes.bed -wa -wb > flanking_genes.bed

What is produced will be a tab-separated value file where each line lists an overlap between your flanking sites and any intersecting genes.

ADD COMMENT
1
Entering edit mode
9.1 years ago

Here's a way to use the BEDOPS toolkit to map Gencode human genes to the 500 kB regions outside your sites.

You could use other annotations, depending on your experiment and what kinds of gene annotations you are interested in.

First, grab the Gencode annotations, convert to BED, and filter them for gene records:

$ wget -O - ftp://ftp.sanger.ac.uk//pub/gencode/Gencode_human/release_18/gencode.v18.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene - \
    > gencode.v18.genes.bed

Then use bedops --range to pad the sites in a sorted version of sites.bed, take the --difference of the padded sites and the original sites to get just the 500 kB padded regions, and then use bedmap --echo-map-id-uniq to build a list of unique gene names, for Gencode genes that overlap the padded regions by one or more bases:

$ sort-bed sites.bed \
    | bedops --range 500000 --everything - \
    | bedops --difference - sites.bed \
    | bedmap --echo --echo-map-id-uniq - gencode.v18.genes.bed \
    > answer.bed

The file answer.bed will contain each 500 kB window and all gene IDs for overlapping genes.

ADD COMMENT
0
Entering edit mode

Thank you! I am relatively new to bioinformatics - where exactly are you typing the above commands? Python/R?

ADD REPLY
2
Entering edit mode

These are on the UNIX shell. If people were including the prompt in the code, you'd usually see the following symbols for each language:

Python: >>>

>>> print("Hello!")

R: >

> x<-read.table(...)

Unix: $

$ echo -e "Hello\tWorld!"
ADD REPLY

Login before adding your answer.

Traffic: 2689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6