Chipseq peak: nearby binding gene
2
1
Entering edit mode
8.7 years ago
Mike ★ 1.9k

Hi All, How can I find other gene present nearby to a specific gene (TFEB) in chipseq peak file (bed format). Please suggest any command line tool/script, instead of uploading on genome browser.

Thanks, Mike

ChIP-Seq • 5.4k views
ADD COMMENT
4
Entering edit mode
8.7 years ago
Sinji ★ 3.2k

You have a couple of options here. You're essentially trying to annotate your peaks.

1) HOMER has the ability to annotate your peaks to the nearest gene, as well as provides a wide variety of other useful tools. It is a command line tool. The tool accepts HOMER peak files or BED files (specific formatting information can be found on the link). As a generic example the tool would be used in the following way:

annotatePeaks.pl $NARROWPEAK_FILE $REFERENCE_GENOME -gtf $REFERENCE_GTF > $OUTPUT

Where $REFERENCE_GENOME could be one of: hg38, hg19, mm9, etc.

Where $REFERENCE_GTF is a gtf formatted dataset from Gencode, or UCSC.

2) ChIPseeker is a R package available from the Bioconductor 'suite' that can be used for annotation, comparison, and visualization of ChIP-seq data. Your question can be answered via section 5 here. But essentially you'd be using the annotatePeak function.

3) Bedtools is a software suite for genomics analysis tasks. Your question could be answered by using the closest function which searches for overlapping features in file A and file B. If no feature in B overlaps A then it will report the nearest feature in B. This means you'd need access to a BED file that has all your gene coordinates, this can easily be grabbed from the UCSC Table Browser. A generic example would look something like this:

bedtools closest -a $NARROWPEAK_FILE -b $GENE_BED_FILE [any options you might want] > $OUTPUT_FILE

I'm sure there's a variety of other annotation software out there, these are just my commonly used ones, but i'm hoping that others are shared.

ADD COMMENT
0
Entering edit mode

Thank you Sinji, for excellent explanation.. but still bit confusion in bedtools command , because I have only single input file NARROWPEAK_FILE , how can I generate second file GENE_BED_FILE .

Thanks, mike

ADD REPLY
0
Entering edit mode

After reading the replies to @Alex_Reynolds i'm going to advise you to wait for his answer. The answers I gave will annotate all your peaks to their nearest gene, but isn't meant to solve the question you seem to be posing.

ADD REPLY
1
Entering edit mode
8.7 years ago

You can use BEDOPS closest-features and BEDOPS gtf2bed or gff2bed with a GTF or GFF file containing gene annotations.

For example, to get annotations for hg19:

$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene - \
    > gencode.v19.genes.bed

Then run a query:

$ closest-features --closest --delim '\t' --no-ref peaks.bed gencode.v19.genes.bed > answer.bed

The --no-ref option leaves out overlaps between peaks, so that what is reported as nearest is a gene annotation. The --closest option picks the nearest of the two annotations that are upstream and downstream of the peak (depending on strandedness, you may want to leave this out, post-processing with awk or similar to choose based on strand; or add --dist and select based on signed distance).

See --help for other useful options.

ADD COMMENT
0
Entering edit mode

Thank you Alex.. Now I have answer.bed file after following your above steps, but how can I find which genes are nearby to "TFEB" gene (my interested gene) in answer.bed file. Thanks a lot, mike

ADD REPLY
0
Entering edit mode

The first few columns will contain the peak, the remaining columns will contain the nearest gene. Remove the --delim '\t' option if you want a clearer delimiter between peak and nearest gene. It is sometimes more convenient to have a tab, however, but the pipe character should make the connection clearer.

Perhaps I am misunderstanding your question, however. Are you instead looking for:

  1. The subset of peaks that overlaps TFEB gene regions (for which you already have information about where they are located)
  2. Nearest genes to that subset of peaks

Please let me know.

ADD REPLY
0
Entering edit mode

Thanks, Alex, I am looking genes which are nearest to the TFEB peak. THanks,

ADD REPLY
0
Entering edit mode

Do you have the TFEB peaks in their own file?

ADD REPLY
0
Entering edit mode

Yes, I have one peak for TFEB in file. Thanks

ADD REPLY
0
Entering edit mode

Is it annotated? In other words, can you pick this out from all the peaks?

ADD REPLY
0
Entering edit mode

yes,

grep TFEB    answer.bed    (answer.bed is my annotated file)

chr6    41651715    41703997    ENSG00000112561.13  .   -   HAVANA  gene    .   gene_id ENSG00000112561.13"; transcript_id "ENSG00000112561.13"; gene_type "protein_coding"; gene_status KNOWN"; gene_name "TFEB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name TFEB"; level 2; havana_gene "OTTHUMG00000014684.12";

Thanks alot

ADD REPLY
0
Entering edit mode

The following will tell you which is the closest gene to this annotation. It works by piping the TFEB-specific annotation to closest-features, to find the nearest Gencode annotation:

$ grep TFEB answer.bed | closest-features --closest --delim '\t' - gencode.v19.genes.bed > TFEB.closest-feature.bed

If you need the N nearest genes, not just the first nearest gene, I describe a solution here that uses a repeated series of piped calls to closest-features.

ADD REPLY
0
Entering edit mode

Hi Alex, Sorry for again.. when i used following command:

$ grep TFEB answer.bed | closest-features --closest --delim '\t' - gencode.v19.genes.bed > TFEB.closest-feature.bed

it returned only TFEB containing single line (chr6 41651715 41703997 ENSG00000112561.13 . - HAVANA ................... TFEB.....)

next, I used closest-features for N nearest gene....

closest-features gencode.v19.genes.bed TFEB.closest-feature.bed |awk -F'|' '{print $3}' - >1.bed 
 closest-features gencode.v19.genes.bed TFEB.closest-feature.bed |awk -F'|' '{print $3}' - |closest-features gencode.v19.genes.bed TFEB.closest-feature.bed |awk -F'|' '{print $3}' - >2.bed
closest-features gencode.v19.genes.bed TFEB.closest-feature.bed |awk -F'|' '{print $3}' - |closest-features gencode.v19.genes.bed TFEB.closest-feature.bed |awk -F'|' '{print $3}' - | closest-features gencode.v19.genes.bed TFEB.closest-feature.bed |awk -F'|' '{print $3}' - >3.bed

paste -d'|' 1.bed 2.bed 3.bed >Results.bed



wc 1.bed 2.bed 3.bed  Results.bed gencode.v19.genes.bed TFEB.closest-feature.bed
   57820  122572  906180 1.bed
   57820  122572  906180 2.bed
   57820  122572  906180 3.bed
   57820  252076 2718540 Results.bed
   57820 1680206 18585614 gencode.v19.genes.bed
       1      58     648 TFEB.closest-feature.bed
4299  125105 1390879 answer.bed

but in all files (1.bed, 2.bed, 3.bed and Results.bed ) there are repeated TFEB , some line of Results.bed file are:

NA|NA|NA
NA|NA|NA
NA|NA|NA
NA|NA|NA
NA|NA|NA
NA|NA|NA
chr6    41651715        41703997        ENSG00000112561.13 ........................TFEB |...
chr6    41651715        41703997        ENSG00000112561.13 ........................TFEB|....
chr6    41651715        41703997        ENSG00000112561.13 ........................TFEB|....

Sorry if im wrong in above protocol

ADD REPLY
0
Entering edit mode

I don't really know what's wrong, but if you want to post your files somewhere where I can download them, I can try running things on this side.

ADD REPLY
0
Entering edit mode

please find this link (this is my annoated file) http://wikisend.com/download/618016/answer.bed

ADD REPLY
0
Entering edit mode

Try:

$ grep TFEB /Users/alexpreynolds/Downloads/answer.bed | closest-features --no-overlaps --closest - gencode.v19.genes.bed

The result I get:

chr6    41651715    41703997    ENSG00000112561.13  .   -   HAVANA  gene    .   gene_id "ENSG00000112561.13"; transcript_id "ENSG00000112561.13"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "TFEB"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "TFEB"; level 2; havana_gene "OTTHUMG00000014684.12";|chr6    41704448    41721847    ENSG00000096088.12  .   -   HAVANA  gene    .   gene_id "ENSG00000096088.12"; transcript_id "ENSG00000096088.12"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "PGC"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "PGC"; level 1; havana_gene "OTTHUMG00000014683.6";
ADD REPLY
0
Entering edit mode

Thank you Alex, I really appreciate your taking the time and help. Last question: so I can say that "PGC" gene is nearest to TFEB.. Am I correct?

Thanks a lot

ADD REPLY
0
Entering edit mode

It is the closest of the two possibilities, yes, ignoring strand.

ADD REPLY

Login before adding your answer.

Traffic: 2488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6