Question

From per bp coverage to average coverage in 10bp intervals

0

Entering edit mode

5.3 years ago

Adrian Pelin ★ 2.6k

Hello, I have per bp coverage I got from bedtools for the entire human genome using:

bedtools genomecov -ibam BAM-FILE.sort.grp.bam -d > BAM-FILE.coverage_per.bp

It looks something like this:

chr12  1   106
chr12  2   103
chr12  3   100
chr12  4   101
chr12  5   99
chr12  6   90
chr12  7   88
chr12  8   89
chr12  9   84
chr12  10  79
chr12  11  72
chr12  12  65
chr12  13  66
chr12  14  64
chr12  15  63
chr12  16  60
chr12  17  53
chr12  18  52
chr12  19  51
chr12  20  49

So for every bp position in the 3Gb genome there is a coverage values. I would like an average values for windows of either 10, 50 or 100bp. So for the above example in windows of 10bp expected output would be:

chr12   1   10  93.9
chr12   11  20  59.5

There must be a really easy way to do this and I am struggling to code a shell script for this.

genome DNA-seq bam • 2.3k views

ADD COMMENT • link updated 5.3 years ago by Alex Reynolds 36k • written 5.3 years ago by Adrian Pelin ★ 2.6k

score 3 · Accepted Answer · 2019-08-01

3

Entering edit mode

5.3 years ago

Alex Reynolds 36k

The file answer.bed will have the mean signal over 10-base increments over the bounds defined by the coverage region.

$ awk -vFS="\t" -vOFS="\t" '{ print $1, $2, ($2 + 1), ".", $3 }' coverage.txt | sort-bed - > coverage.bed5
$ bedops --merge coverage.bed5 | bedops --chop 10 - | bedmap --echo --mean --delim '\t' - coverage.bed5 > answer.bed

ADD COMMENT • link 5.3 years ago by Alex Reynolds 36k

0

Entering edit mode

This is a very neat approach indeed, thanks Alex! Testing out on mouse genome now.

ADD REPLY • link 5.3 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

In theory, if you were interested in doing this for the entire genome for multiple samples, you wouldn't need to do the "bedops --merge coverage.bed5 | bedops --chop 10 -" right? You could get the chopped file from a fasta file, if you had the length of each chromosome.

ADD REPLY • link 5.3 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Yep! In fact, there is a handy tool called fetchChromSizes in the UCSC Kent utilities kit for doing this, without FASTA. You might look at this answer, replacing hg38 with mm10, or whichever mouse assembly you're using, and modifying the chop parameter: A: Is it possible to generate a BED/BEDGRAPH containing every dinucleotide in the g

Instead of chopping up a genome each time you want to do this for each sample, you might chop it once. Then, for each sample, specify it as a file in the bedmap statement, replacing the hyphen with the filename of the chopped regions.

ADD REPLY • link 5.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Indeed, I was using read_fasta from biopieces to get contig sizes. One follow up question. Is there a simpler way to get average coverage from a bam file for 10bp windows? rather than getting a per base coverage with bedtools genomecov, which produces a massive 50+GB file for mouse/human genomes and then processing it your way, which works but seems that it needs a lot of space.

ADD REPLY • link 5.3 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

You could use a streaming approach, combining piping with bash process substitutions, e.g.:

$ bedmap --echo --count --delim '\t' bins.bed <(bam2bed < reads.bam) > answer.bed

The file answer.bed will contain the bin and the number of reads which overlap that bin. This is pretty simplistic. You could do more here to do sliding windows or only map a read to one bin, to avoid multiple counts, etc. But the idea is to use a simple example to demonstrate input/output streams as a way to avoid intermediate files.

The file bins.bed are the 10nt bins generated from chopping fetchChromSizes or by merging and chopping contigs.

These bins can come from an upstream process, to avoid creating intermediate files:

$ bedops --merge contigs.bed | bedops --chop 10 - | bedmap --echo --count --delim '\t' - <(bam2bed < reads.bam) > answer.bed

However, if you have lots of bam files, it probably makes more sense (saves more time) to generate bins.bed once and reuse it. Or use a compressed equivalent, described below.

The BEDOPS kit includes a compression format called Starch that can compress BED files to about 3-4% of their original size:

$ bedops --merge contigs.bed | bedops --chop 10 - | starch --omit-signature - > bins.starch

You can use Starch files in BEDOPS operations directly. For instance, here I am replacing the file bins.bed with its compressed equivalent bins.starch:

$ bedmap --echo --count --delim '\t' bins.starch <(bam2bed < reads.bam) > answer.bed

If you have to have an intermediate BED file, then Starch can be a very useful way to save disk space, and still do the same operations.

ADD REPLY • link 5.3 years ago by Alex Reynolds 36k

0

Entering edit mode

Excellent, this is a wonderful speedier alternative, thanks. Two comments. First, starch for some reason omits a contigs name past a dot character ".", ex (hEIF4E2.Homo_sapiens became hEIF4E2). Perhaps this was intended as such, however for custom genome assemblies and contig naming this may be an issue. Second, the approach using bedtools genomecov uses a --mean parameter for bedops while the bam2bed approach uses a --count argument. I tried using --mean with bam2bed but that produced a lot of "NAN" and averages that made no sense. When compare genomecov mean and bam2bed count with the same interval file, there are consistent in where they determine coverage, however not always on the actual values. For instance a genomecov mean can be 1.100000 while a bam2bed count can be 2, a 6.180000 a 14 and so on. Obviously they use different strategies, I am wondering where does the difference lie in.

ADD REPLY • link 5.3 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

I think I get it, the second approach, bedops counts amount of reads overlapping by at least 1bp in the intervals specified, so it's not really a mean coverage per bp in the interval but rather amount of reads. So a read can potentially increase the count of two different intervals. Works as well.

ADD REPLY • link 5.3 years ago by Adrian Pelin ★ 2.6k

0

Entering edit mode

Bam2bed only converts reads to BED. It does not do any statistical summarization like coverage.

This is where downstream operations with bedmap --count come in, which counts raw reads over intervals.

Use of a summary operation like --mean makes sense for coverage signal (which has already applied some kind of counting op), but it does not make sense for raw reads.

If a read falls over multiple intervals, it will be counted over each interval without some extra logic. I can suggest an approach that assigns a read to only one bin, if you like.

ADD REPLY • link 5.3 years ago by Alex Reynolds 36k