Question

How to best summarize read counts overlapping promoters when read counts are defined from different regions

0

Entering edit mode

5.0 years ago

jamespower ▴ 100

Hi,

I have read counts from a dataset (dataset1) that are defined from regions of 100 bases, and I am trying to find the best way to get a one value (read counts value) for each promoter region (of 1000 bases). In particular, how to treat read counts in overlapping regions?

I know there is a lot of literature for how to deal with overlapping features (featureCounts, htseq-count), but I do not want to remove overlapping promoters as most recommend, nor I want to choose one or the other, but I would like a summary that includes all overlaps. Also note I only have counts, and not the original read positions.

I will do the same also for another dataset (dataset2). My final goal is to use DESeq2 across promoters for dataset1 and dataset2. Do you have suggestions of the best way to do this?

Thanks!

RNA-Seq gene next-gen assembly ChIP-Seq • 1.7k views

ADD COMMENT • link updated 5.0 years ago by GouthamAtla 12k • written 5.0 years ago by jamespower ▴ 100

0

Entering edit mode

Unclear what you really want to do. Please provide illustrative example data with input and expected output.

ADD REPLY • link 5.0 years ago by ATpoint 88k

0

Entering edit mode

Hi, my question is not how to do it but what is the best approach with overlapping read counts, which depends on assumptions on distribution of read counts: I don't know whether to sum the counts, to take the mean, the average, or do something else. Maybe this is not the best website to ask this?

ADD REPLY • link 5.0 years ago by jamespower ▴ 100

0

Entering edit mode

How you would remap if you don't have original position (BAM?) and just the count matrix?

ADD REPLY • link 5.0 years ago by piyushjo ▴ 710

0

Entering edit mode

Sorry, I cannot remap. Let me correct, thank you.

ADD REPLY • link 5.0 years ago by jamespower ▴ 100

score 1 · Accepted Answer · 2020-05-04

If two promoters overlap, there is no way you could resolve the data points from which promoter they came from unless you have a dataset with stranded and single base pair resolution.

One option is to assign the bins (regions of 100 bases) to both the overlapping promoters. Then get average read counts for each promoter by averaging the reads from bins that overlap that promoter.

Other option is to merge the promoter coordinates (mergeBed) and assign a new ID.

Another option is to split the region of overlapping promoters into unique and common regions i.e you will create 3 regions from two overlapping promoters. For example, using bedops partition as depicted below.

enter image description here

Its very uncommon that the promoters overlaps as the definition of promoter is loosely based on +/- 1-2kb upstream/downstream of a TSS. Or based on epigenome marks which are usually up to 2kb.