I have 1000 bedgraphs with coverage data of 1000 human samples. From these 1000 bedgraphs, I want to know which regions of the human genome have 30x or more coverage in 20% of the samples (atleast 200 samples).
Right now, I map the bedgraphs onto the per-base human genome, calculate the depth at each base, and paste the depth column onto a per-base human genome file. (All chromosomes separately).
paste all.chr4.bed <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed AB08136228HG_coverage.bg) <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed AC71055668ZJ_coverage.bg) <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed AF63197856OD_coverage.bg) ... <(bedmap --echo-map-id --unmapped-val 0 GRCh38.chr4.bed LASTSAMPLE.bg)
GRCh38.chr4.bed and all.chr4.bed are per-base chromosome file:
chr4 0 1
chr4 1 2
chr4 2 3
chr4 3 4
chr4 4 5
chr4 5 6
The .bg files are sample coverage bedgraphs:
chr1 0 9995 0
chr1 9995 9997 5
chr1 9997 9999 24
chr1 9999 10000 48
chr1 10000 10001 63
chr1 10001 10005 83
At the end all.chr4.bed becomes:
chr4 0 1 0 0 0 0 0
chr4 1 2 0 1 0 1 0
chr4 2 3 0 0 0 0 0
chr4 3 4 0 0 1 0 3
chr4 4 5 0 0 0 1 0
Column 4 onwards, its the perbase coverage for a single sample.
Next, from this I filter the regions which have >30x coverage in 20% of the samples.
I am running into a problem, that when I have 1000 bedgraphs, all.chr4.bed creation becomes really slow. Till 200 samples the time taken was okay. Is there any way to directly calculate the depth of coverage from mulitple samples at a position thats faster than this? Or anyway I can improve/ alternate ways to do this?
Any pointers would be appreciated. Thanks.
Hi, yes I have a cluster. I am currently doing this in parallel for different chromosomes (so total 22 jobs). Instead of parallelizing by chromosome, doing it for 200 samples in chunks is one idea. Would it be much better than my current method though?
If you have a cluster, I don't see why you couldn't parallelize both by chromosome and by groups of signals, unless I'm missing something.
It might be a little more accounting but it should help speed things up, if you've already noticed that running smaller groups of 200 signals completes more than 5x faster than a 1000 signal set.
Okay, I'll try this out. Thanks!
You might look at the deeptools kit, but I don't know if it will do exactly what you need to do. It might be useful if you can work with the BAM files directly, instead of bedGraph files, and if deeptools can do summary statistics at a per-base level that you can then filter by your critieria.
The only other thing I can think of to make things faster is a database, but there will be significant memory and time costs in creating the database. That approach would allow fast constant-time lookups, but that would only make sense if you are looking up positions repeatedly, which allows you to amortize the cost of database creation. For a one-off analysis, this would not make sense.
I'll check deeptools as well. Database option is not feasible as you mentioned.