Question

Finding Overlapping Regions of Chip-Seq peaks, then visualising it.

0

Entering edit mode

9.4 years ago

morovatunc ▴ 560

Hi,

I am trying to visualise my overlapped chip-seq peak regions which I analysed with Homer mergePeaks function. I have got one venn info file and a "result" file. I would like to use that venn info file then visualise it but when I looked for visualisation libraries or programs, I did not find a method which merits my expectations.

The primary problem is my data is big. (relatively). I have 19 datasets in one conditions group and 9 datasets in healthy one. I have read making venn diagram for more than 3 datasets would not be smart on biostar tread.

So far, I have tried DiffBind R library to this but I couldn't have figured out its class structure. Also, Venndiagram and Vennerable did not quit work for me due the output of the homer does not merit the object requirements of both of the libraries.

I am trying to find overlapped regions of transcriptional factors that why I wanna know which transcriptional factors sites are most common.

I am python coding and R mediocre.

Please don't post me (Venn/Euler Diagram Of Four Or More Sets and Draw Diagrams For Intersection Between Many Sets threads I have already read them 0192308 times), also if you think homer is not the best tool for finding the overlaps, please feel free to advice others. (Yes, I do know monkseq)

Thank you very much for your help. I have been dealing this step of my project for a week. So I am very frustrated.

Best regards,
Tunc

ChIP-Seq Venn • 5.4k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 9.4 years ago by morovatunc ▴ 560

Ram · Answer 1 · 2015-10-31

0

Entering edit mode

9.4 years ago

C ▴ 20

I'm not completely familiar with Homer's mergePeaks function, so take my advice with a grain of salt. Have you considered perhaps taking a look at the intersect function found in bedtools?

Also perhaps what you'd like to do is visualize the data in a type of bargraph? If I understand correctly you want to overlap regions of transcriptional factors and then see which TF's are most common? While Venn Diagrams would usually be the way to go for this sort of thing if you had 3 data sets or less, perhaps something simple like a bargraph might do the trick.

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by C ▴ 20

0

Entering edit mode

you were right, my venn data looked pretty complicated. Could you please open up that bar graph idea?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by morovatunc ▴ 560

0

Entering edit mode

He/she might be referring to something like this Venn/Euler Diagram Of Four Or More Sets

ADD REPLY • link 9.4 years ago by dariober 15k

Ram · Answer 2 · 2015-11-02

Maybe this procedure could help you. It needs bedtools mergeBed and standard unix commands.

First add to each dataset a column as identifier, for example the filename itself:

for peakFile in peakCond1Dataset1.bed peakCond1Dataset2.bed ... peakCondNDatasetN.bed
do
awk -v OFS="\t" -v id=$peakFile '{print $1, $2, $3, id}' $peakFile > ${peakFile}.tmp.bed
done

Then concatenate, sort and merge all these files (19+9 files, right?):

cat *.tmp.bed \
| sort -k1,1 -k2,2n \
| mergeBed -c 4,4 -o distinct,count_distinct -i > merged.bed

With mergeBed as above you get in the last two columns the list of datasets that have been combined in each merged peak and the number of distinct datasets contributing to each merged peak. From this it should be fairly easy to summarize peak overlaps within and between conditions, especially if the data identifiers from step 1 have information about condition and replicate.

EDIT

After having identified "consensus" peaks region, you might ask how enriched are these regions in each individual library. The original peak files are no longer useful here since the consensus peak regions have boundaries different from the original peaks (peaks have been merged) and some peak file could not contain a region at all, if it didn't pass the stringency of the peak caller. To this end, for each library (bam file) I count the reads falling in each consensus region and quantify enrichment relative to the surrounding background, i.e. simply "#reads in region / # reads in surrounding", appropriately scaled by size. I wrote a simple script for this, if interested, localEnrichmentBed.py