Entering edit mode
5.3 years ago
Adrian Pelin
★
2.6k
Hello, I have per bp coverage I got from bedtools for the entire human genome using:
bedtools genomecov -ibam BAM-FILE.sort.grp.bam -d > BAM-FILE.coverage_per.bp
It looks something like this:
chr12 1 106
chr12 2 103
chr12 3 100
chr12 4 101
chr12 5 99
chr12 6 90
chr12 7 88
chr12 8 89
chr12 9 84
chr12 10 79
chr12 11 72
chr12 12 65
chr12 13 66
chr12 14 64
chr12 15 63
chr12 16 60
chr12 17 53
chr12 18 52
chr12 19 51
chr12 20 49
So for every bp position in the 3Gb genome there is a coverage values. I would like an average values for windows of either 10, 50 or 100bp. So for the above example in windows of 10bp expected output would be:
chr12 1 10 93.9
chr12 11 20 59.5
There must be a really easy way to do this and I am struggling to code a shell script for this.
This is a very neat approach indeed, thanks Alex! Testing out on mouse genome now.
In theory, if you were interested in doing this for the entire genome for multiple samples, you wouldn't need to do the "bedops --merge coverage.bed5 | bedops --chop 10 -" right? You could get the chopped file from a fasta file, if you had the length of each chromosome.
Yep! In fact, there is a handy tool called
fetchChromSizes
in the UCSC Kent utilities kit for doing this, without FASTA. You might look at this answer, replacing hg38 with mm10, or whichever mouse assembly you're using, and modifying the chop parameter: A: Is it possible to generate a BED/BEDGRAPH containing every dinucleotide in the gInstead of chopping up a genome each time you want to do this for each sample, you might chop it once. Then, for each sample, specify it as a file in the
bedmap
statement, replacing the hyphen with the filename of the chopped regions.Indeed, I was using read_fasta from biopieces to get contig sizes. One follow up question. Is there a simpler way to get average coverage from a bam file for 10bp windows? rather than getting a per base coverage with bedtools genomecov, which produces a massive 50+GB file for mouse/human genomes and then processing it your way, which works but seems that it needs a lot of space.
You could use a streaming approach, combining piping with bash process substitutions, e.g.:
The file
answer.bed
will contain the bin and the number of reads which overlap that bin. This is pretty simplistic. You could do more here to do sliding windows or only map a read to one bin, to avoid multiple counts, etc. But the idea is to use a simple example to demonstrate input/output streams as a way to avoid intermediate files.The file
bins.bed
are the 10nt bins generated from choppingfetchChromSizes
or by merging and chopping contigs.These bins can come from an upstream process, to avoid creating intermediate files:
However, if you have lots of bam files, it probably makes more sense (saves more time) to generate
bins.bed
once and reuse it. Or use a compressed equivalent, described below.The BEDOPS kit includes a compression format called Starch that can compress BED files to about 3-4% of their original size:
You can use Starch files in BEDOPS operations directly. For instance, here I am replacing the file
bins.bed
with its compressed equivalentbins.starch
:If you have to have an intermediate BED file, then Starch can be a very useful way to save disk space, and still do the same operations.
Excellent, this is a wonderful speedier alternative, thanks. Two comments. First, starch for some reason omits a contigs name past a dot character ".", ex (hEIF4E2.Homo_sapiens became hEIF4E2). Perhaps this was intended as such, however for custom genome assemblies and contig naming this may be an issue. Second, the approach using bedtools genomecov uses a --mean parameter for bedops while the bam2bed approach uses a --count argument. I tried using --mean with bam2bed but that produced a lot of "NAN" and averages that made no sense. When compare genomecov mean and bam2bed count with the same interval file, there are consistent in where they determine coverage, however not always on the actual values. For instance a genomecov mean can be 1.100000 while a bam2bed count can be 2, a 6.180000 a 14 and so on. Obviously they use different strategies, I am wondering where does the difference lie in.
I think I get it, the second approach, bedops counts amount of reads overlapping by at least 1bp in the intervals specified, so it's not really a mean coverage per bp in the interval but rather amount of reads. So a read can potentially increase the count of two different intervals. Works as well.
Bam2bed only converts reads to BED. It does not do any statistical summarization like coverage.
This is where downstream operations with
bedmap --count
come in, which counts raw reads over intervals.Use of a summary operation like
--mean
makes sense for coverage signal (which has already applied some kind of counting op), but it does not make sense for raw reads.If a read falls over multiple intervals, it will be counted over each interval without some extra logic. I can suggest an approach that assigns a read to only one bin, if you like.