Question

Coverage depth map of genome from multiple sample coverage bedgraphs?

0

Entering edit mode

11 months ago

Agastya ▴ 10

I have 1000 bedgraphs with coverage data of 1000 human samples. From these 1000 bedgraphs, I want to know which regions of the human genome have 30x or more coverage in 20% of the samples (atleast 200 samples).

Right now, I map the bedgraphs onto the per-base human genome, calculate the depth at each base, and paste the depth column onto a per-base human genome file. (All chromosomes separately).

paste all.chr4.bed <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed AB08136228HG_coverage.bg) <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed AC71055668ZJ_coverage.bg) <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed AF63197856OD_coverage.bg) ... <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed LASTSAMPLE.bg)

GRCh38.chr4.bed and all.chr4.bed are per-base chromosome file:

chr4    0   1
chr4    1   2
chr4    2   3
chr4    3   4
chr4    4   5
chr4    5   6

The .bg files are sample coverage bedgraphs:

chr1    0   9995    0
chr1    9995    9997    5
chr1    9997    9999    24
chr1    9999    10000   48
chr1    10000   10001   63
chr1    10001   10005   83

At the end all.chr4.bed becomes:

chr4    0   1   0   0   0   0   0
chr4    1   2   0   1   0   1   0
chr4    2   3   0   0   0   0   0
chr4    3   4   0   0   1   0   3
chr4    4   5   0   0   0   1   0

Column 4 onwards, its the perbase coverage for a single sample.

Next, from this I filter the regions which have >30x coverage in 20% of the samples.

I am running into a problem, that when I have 1000 bedgraphs, all.chr4.bed creation becomes really slow. Till 200 samples the time taken was okay. Is there any way to directly calculate the depth of coverage from mulitple samples at a position thats faster than this? Or anyway I can improve/ alternate ways to do this?

Any pointers would be appreciated. Thanks.

bedops bedmap coverage • 900 views

ADD COMMENT • link 11 months ago by Agastya ▴ 10

score 1 · Answer 1 · 2024-03-25

1

Entering edit mode

11 months ago

Alex Reynolds 36k

Do you have a cluster? This problem can be parallelized in chunks of bedGraph files — say, 200 samples per job — then reduce the chunks in the end to a final coverage value.

My guess is that system memory is running low having a thousand processes running, but couldn't say for sure without knowing more about inputs and system. Parallelization is an easy fix, if you have hosts you can access.

ADD COMMENT • link 11 months ago by Alex Reynolds 36k

0

Entering edit mode

Hi, yes I have a cluster. I am currently doing this in parallel for different chromosomes (so total 22 jobs). Instead of parallelizing by chromosome, doing it for 200 samples in chunks is one idea. Would it be much better than my current method though?

ADD REPLY • link 11 months ago by Agastya ▴ 10

0

Entering edit mode

If you have a cluster, I don't see why you couldn't parallelize both by chromosome and by groups of signals, unless I'm missing something.

It might be a little more accounting but it should help speed things up, if you've already noticed that running smaller groups of 200 signals completes more than 5x faster than a 1000 signal set.

ADD REPLY • link 11 months ago by Alex Reynolds 36k

0

Entering edit mode

Okay, I'll try this out. Thanks!

ADD REPLY • link 11 months ago by Agastya ▴ 10

0

Entering edit mode

You might look at the deeptools kit, but I don't know if it will do exactly what you need to do. It might be useful if you can work with the BAM files directly, instead of bedGraph files, and if deeptools can do summary statistics at a per-base level that you can then filter by your critieria.

The only other thing I can think of to make things faster is a database, but there will be significant memory and time costs in creating the database. That approach would allow fast constant-time lookups, but that would only make sense if you are looking up positions repeatedly, which allows you to amortize the cost of database creation. For a one-off analysis, this would not make sense.