How to plot coverage and depth statistics of a bam file
8
30
Entering edit mode
10.4 years ago
kay ▴ 380

Hello,

I am able to find tools like Bedtools, Samtools, etc that calculate the coverage and depth statistics for a BAM file.

But I am looking for a tool that will plot them for my BAM file. I tried a tool called BAMStats (http://bamstats.sourceforge.net/). This tool did not work on my BAM file - it errored out saying "too many reference files". My BAM file is from viral samples so there are many reference genomes.

I am able to run Bedtools and Samtools mpileup and get some numbers spit out, but I don't know how to create a plot out of it.

Any help on this will be useful.

Thanks

Krithika

bam coverage depth next-gen-sequencing • 109k views
ADD COMMENT
0
Entering edit mode

Hi Deedee - thanks for the response. Can you many give some explanation on which tool in the Picard suite of tools can be used ?

ADD REPLY
0
Entering edit mode

Definitely! Answer updated.

ADD REPLY
0
Entering edit mode

I did try out CollectAlignmentSummaryMetrics from Picard - I'm not exactly sure that is what I am looking for. I have also tried BedTools' genomeCov.

Basically what I am looking for is:

  • Looking for a tool that will spit out some statistics about coverage and depth of BAM file. I want to use this information to plot coverage and depth
  • OR find a tool that will directly spit out these plots. I found two tools that do this - BamStats and Qualimap. Both are not working on my data, probably because they are not human genome, but viral genomes.
ADD REPLY
0
Entering edit mode

Hi!

You can use the python program bam2plot, which I have developed. Install via pip install bam2plot, and run:

bam2plot my_bam.bam

It produces one png and one svg for each reference (e.g. chromosome) in the bam file.

Check it out here: https://github.com/willros/bam2plot

Regards, William

ADD REPLY
0
Entering edit mode

Hi William,

I was trying to use bam2plot you have developed. It was installed on a python evnviroment python version 3.11.

The installation went well without a problem. Here are the error messages I have received. the bam file is indexed and sorted. error message Please advise, and many thanks. Jinli

ADD REPLY
0
Entering edit mode

Hi Jinli,

It seems to be related to the pandas version you are using. Which version are you using? Try to install the latest version and see if it works then. Let me know otherwise!

ADD REPLY
48
Entering edit mode
8.3 years ago
kirannbishwa01 ★ 1.6k

This is the same problem I recently had, but I was not able to find a good solution. I was finally able to work a method which I am sharing here. While this question is 2 years old but I am hoping there are many individuals trying to work out the solution to this problem, so I hope this will be helpful for others.

Identification of the depth of coverage is quite useful in 1) identifying the regions that might have potential paralogous alignment 2) finding the coverage at your regions of interest.

Here is a step by step guide to the problem/question.

Step #1) First identify the depth at each locus from a bam file.

I have found samtools depth option more useful in this regard, when coverage at each locus is desired.

samtools depth deduped_MA605.bam > deduped_MA605.coverage

The output file 'deduped_MA605.coverage' file will have 3 columns (Chr#, position and depth at that position) like below.

Chr  position depth (this header will be absent though)
1       3980        66
1       3981        68
1      3982     67
1      3983        67
1      3984     68

Step #2) Now, select the coverage (depth) by locus for each chromosome and/or regions

We can use the coverage file to plot it in R. But, the file is so large that it will suck up almost all the memory. It better to split the coverage by chromosome (or region of the chromosome if required).

To select the coverage for a particular chromosome (Chr#1 in my case)

awk '$1 == 1 {print $0}' deduped_MA605.coverage > chr1_MA605.coverage

To select coverage from chr #2

awk '$1 == 2 {print $0}' deduped_MA605.coverage > chr2_MA605.coverage

If the chrosomosome has string characters it can be adjusted as

awk '$1 == "chr2" {print $0}' deduped_MA605.coverage > chr2_MA605.coverage

Step #3) To plot the data in R this coverage file will need to be imported and the headers need to be added.

MA605.chr2 <- read.table("/media/everestial007/SeagateBackup4.0TB/New_Alignment_Set/05-B-deDupedReads-forScanIndel/chr2_MA605.coverage",
         header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)`

Note: The header of the column are automatically set as V1, V2 and V3.

To rename the headers

library(reshape) # loads the library to rename the column names

MA605.chr2<-rename(MA605.chr2,c(V1="Chr", V2="locus", V3="depth")) # renames the header

Now, plot the coverage by depth:

plot(MA605.chr2$locus, MA605.chr2$depth)

to get the wider/cleaner view of the plot use

library(lattice, pos=10) xyplot(depth ~ locus, type="p", pch=16, auto.key=list(border=TRUE), par.settings=simpleTheme(pch=16), scales=list(x=list(relation='same'), y=list(relation='same')), data=MA605.chr2, main="depth by locus - Chr2 (SampleMA605)")

The output plot looks like this: coverage depth by locus

Note:

  • We can see that there is a region around the middle and next to it that has a very high coverage. This likely suggests paralogous alignment.

  • Also, there is a gap in alignment. The region with no alignment is near (at) centromere where no consensus sequence has been found.

  • Reads may be later filtered by coverage.

Thanks,

ADD COMMENT
1
Entering edit mode

Step #2) can be done in one step like this

gawk '{/^[0-9]/{print >$1".coverag"}' ./deduped_MA605.coverag

and there is a package called Sushi can do the same it takes as input bedgraph

ADD REPLY
0
Entering edit mode

This is really useful. However on the final step I get an error message from R:

Unexpected symbol in "library(lattice, pos=10) xyplot"

I can't see an error here but obviously there is one. If anyone has a good eye and has any suggestions to correct this that would be great...

ADD REPLY
0
Entering edit mode

Try it now. The full code was hidden due to formatting. It should work now.

ADD REPLY
0
Entering edit mode

Would you guide how to plot same chromosome in one graph from multiple samples (bam) files?

Thanks.

ADD REPLY
0
Entering edit mode

I also need a manual for multiple samples. After doing samtools depth and selecting the right chromosome i get a file with the chr column, one column for the regions and for each individuum/sample a depth value.

I really hope someone can help me Thanks;)

ADD REPLY
0
Entering edit mode

I tried this and have the same error..

Error: unexpected symbol in "library(lattice, pos=10) xyplot"

can one advise?

ADD REPLY
0
Entering edit mode

Add a new line after library(lattice, pos=10)

ADD REPLY
10
Entering edit mode
6.3 years ago
Fred0730 ▴ 100

Step #1. Calculate the depth along the genomic locus by Samtools.

samtools depth reads.sort.bam > reads.sort.coverage

Step #2. Plot the coverage by ggplot function in R. (The demo data only includes one chromosome)

setwd("D://R")  # working path
coverage=read.table("reads.sort.coverage", sep="\t", header=F)
install.packages('reshape')
library(reshape)
coverage=rename(coverage,c(V1="Chr", V2="locus", V3="depth")) # renames the header
ggplot(coverage, aes(x=locus, y=depth)) +
geom_point(colour="red", size=1, shape=20, alpha=1/3) +
scale_y_continuous(trans = scales::log10_trans(), breaks = scales::trans_breaks("log10", function(x) 10^x))
ADD COMMENT
2
Entering edit mode

what about if the reads.sort.coverage is huge? (in my case 50GB)

ADD REPLY
9
Entering edit mode
10.4 years ago
Dan D 7.4k

The Picard suite is your ticket.

EDIT:

Specifically, CollectAlignmentSummaryMetrics gives you your alignment percentage (and many other metrics) for a BAM file input.

If you have a interval table (something like a BED file) which describes genomic regions of interest, you can calculate depth of coverage statistics using CalculateHSSummaryMetrics. This tool will also give you detailed alignment metrics as well, so you can run it in place of CollectAlignmentSummaryMetrics.

More generally, there's also BedTools' genomeCov and GATK's DepthOfCoverage

ADD COMMENT
3
Entering edit mode

This doesn't answer the question of plotting the coverage across genome.

ADD REPLY
1
Entering edit mode

The link does not work either.

ADD REPLY
0
Entering edit mode

Picard has moved to Broad, so here's an updated link for those reading today: http://broadinstitute.github.io/picard/command-line-overview.html#CollectAlignmentSummaryMetrics

ADD REPLY
3
Entering edit mode
10.4 years ago
rbagnall ★ 1.8k

Take a look at this blog post from Stephen Turner., which has a step by step guide to visualize coverage for targeted NGS (exome) experiments

ADD COMMENT
3
Entering edit mode
ADD COMMENT
0
Entering edit mode

does it work if you are not interested into knowing the coverage per regions of interest but for all the data?

ADD REPLY
0
Entering edit mode

Hi, this is a cool tool, I was just playing around and comparing it to using R to plot the output of samtools depth as described above. I get somewhat different looking plots from wcscoverageplotter vs. plotting the output of samtools depth -- I think you are doing some kind of smoothing, could you describe exactly your tool works on the default settings? Your tool also reports a somewhat lower typical coverage, in my data roughly about 60x vs 70x, any idea why that would be? Thanks!

ADD REPLY
0
Entering edit mode

default is too bin the coverage under a pixel by the median depth.

ADD REPLY
0
Entering edit mode

Thanks Pierre, so each point on the line is the median depth of all reads contained in that pixel? So these are non-overlapping bins, yes?

Also, one other thing confused me: samtools depth does not ask for the reference genome, however wcscoverageplotter requires it. How is this information used?

ADD REPLY
1
Entering edit mode

it's used to decode CRAM files.

ADD REPLY
1
Entering edit mode

Hey Pierre,

Can you please explain what the green and red lines mean? Maybe it is 'obvious', but I can't seem to find an explanation.

Thanks, Aaron

ADD REPLY
1
Entering edit mode

mean and median depth

ADD REPLY
0
Entering edit mode

green line - mean, red line - median depth

Right?

ADD REPLY
0
Entering edit mode

Thanks, Pierre!

ADD REPLY
0
Entering edit mode
10.4 years ago
Vivek ★ 2.7k

I doubt you'll find an out of the box plotting solution for this but if all you need is coverage information, you could try using samtools mpileup to generate read coverage at each position in the BAM and make a plot using R.

ADD COMMENT
0
Entering edit mode

Hi Vivek,

Would you elaborate more on this, if possible by showing the commands?

Thanks!

ADD REPLY
0
Entering edit mode
samtools mpileup your.bam | awk '{print $1"\t"$2"\t"$4}'

Gives you coverage at each chr:pos

There after you can use something like ggplot in R to plot a histogram of coverage.

ADD REPLY
0
Entering edit mode

Thanks Vivek. I have gone through various posts but still unclear whether to use samtools depth or samtools mpileup to calculate coverage of my multiple bam files. Secondly, as each file is very big, I am thinking to plot different chromosomes from multiple bam files in multiple plots, for e.g. if it consists of 23 chromosomes, I can plot 23 graphs where each graph contains same chromosomes coverage from multiple bam files. I hope I am in the right direction. But, if I want to see the coverage of all chromosomes in all bam files, what would you suggest?

ADD REPLY
2
Entering edit mode

Hi @bioinfo8, just gettting back to this problem a little late. I think you can totally use the code I shared in my answer - using Samtools depth. I think I understand coverage of multiple bam files - i think you have 23 samples (rite??) and you want to plot the coverage for same chromosome from different samples in one plot. I think this might be possible by merging the dataframe from different samples by chr and pos, then create the plot and merge the plot in someway (merge the PDF of jpg/png file), if you create the same pixel by inch for each different plot then merging might work nicely - I haven't tried this but based on my experience this should totally work.

ADD REPLY
0
Entering edit mode

I'm not sure what you mean by all BAM files, in general its best to view coverage over a single sample (BAM) rather than combine across samples as you'd otherwise miss the variation between samples. You should be able to plot all chromosomes at once even if its a WGS experiment but if it is too cumbersome to parse and you have limited computing power, you can probably sub-sample by taking the coverage every 10 bases and likely still get a good enough idea.

ADD REPLY
0
Entering edit mode

Ok. I have exome bam files of different individuals from same organism.

ADD REPLY
0
Entering edit mode

You have a very clear answer from Vivek. Now you want do for all bans write a script then. It will be better.

ADD REPLY
0
Entering edit mode
7.4 years ago
bioinfo8 ▴ 230

I have used properly-paired qc-passed reads for chromosome wise depth calculation and plotting. However, still I can see 0 value for depth at some places in plot, even though I have used samtools depth and not samtools depth -a. Why is it so? Should I remove those 0 values from my data to proceed further or keep them. I am not clear on that.

Also samtools mpileup and samtools depth gives same output.

Please guide.

Thanks!

ADD COMMENT
3
Entering edit mode

I read all the answers/comments on this one. But, I have different opinion regarding this problem. No matter how much deep sequencing you do (even in whole genome) there is a good probability that there will be good amount of genomic regions with 0 depth. The reason is with short reads it is difficult to capture all the reads of the genome - although our assumption is that we are sampling reads from every region of the genome.

  • Another problem is with the repeatitive regions, centromere, transposons, gene duplication etc. - reads from which will not align perfectly creating 0-depth in several genomic regions.
  • In the plot I have for my answer above you can see a big gap in the centromeric region (even though that sample was about whole genome and high coverage sequencing).
  • For the rest of the plot you don't see any gap, but the fact is that there are lots or region with 0-depth but the plot is so compact (such a large chromosome length is represented on such a small x-axis line) that 0-depth region is hard to see. If you could expand the plot by 1000x times you should start seeing the gaps.
  • With exome seq this is going to be more obvious situation.
  • If you are making a coverage plot you have to keep all the position (1-end of chromosome) with every coverage unless you are plotting the coverage on interval basis which is possible using samtools.

Unless I have missed something about the question or about some idea about the coverage, I don't really know what else could be causing the problem you are having. The fact is it's not a problem, that's how the coverage really looks like (0-max depth). But, let me know if you have any questions, tips.

Thanks,

ADD REPLY
0
Entering edit mode

If this is exome sequencing data, you should see zero depth in off target regions after QC to remove mis-mapped reads. Did you try restricting your query to a BED file of target regions?

ADD REPLY
0
Entering edit mode

Yes, it is exome data. I just calculate depth without any bed file by writing a script for aligned bam files to calculate depth:

samtools depth ${describer}.ppqc_sorted.bam > ${describer}_ppqc_sorted.depth

Then splitted each depth file chromosome wise (2nd script):

awk '{print $0 >> $1"_${describer}_ppqc_sorted.depth"}' ${describer}_ppqc_sorted.depth
ADD REPLY
1
Entering edit mode

So did you check if the zero coverage positions lie within your targeted regions? If they don't, you can safely exclude them, if they do lie in your target regions and there is a large chunk of them you can ask your sequencing provider about the poor quality.

ADD REPLY
1
Entering edit mode

Vivek is right . You need to understand if the 0 depth is in regions that are off target or on target . Ideally if it's off target then it should be fine. However depending on the kit of exome target beads and version you are using for enrichment you might have 0 depths also on on target regions as they usually expand at bases more than the actual exonic lengths. You need to take that into account and see if there is a larger proportion of 0 depths in those region or not. If not then removing all off targets you will be able to do your plots

ADD REPLY
0
Entering edit mode

Thanks Vivek and vchris_ngs. I have only this information that these are whole exomes from multiple individuals (bam files). After analyzing them I found out that they are paired-end bam files already aligned with reference and hence I am trying to figure out ways to study them.

So in case of whole exomes, I assume the targets would be protein coding regions only.

ADD REPLY
1
Entering edit mode

You cannot assume that, some exome capture designs also cover UTR regions. You need to ideally get a specific bed file of targeted regions or alteast the name of the capture kit used so you can go to their webpage and download target regions from there.

ADD REPLY
1
Entering edit mode

so if you are using the Agilent kit for the enrichment for the target beads then probably the industry standard is v5/v6 as far as I can recall. So ask the wet lab person who did the library and the enrichment process they can tell you about the company and the version and then look for the bed files related to that specific version of the kit. They are actually easy to retrieve for that specific product. This bed file will then help you in assessing the off target and on target estimation and adding up to it help in making an assessment of 0 depths in those regions as well. Voila, you will have your answers and then make all the pretty plots you want to do.

ADD REPLY
0
Entering edit mode

Ok, I will check it. Thanks :)

ADD REPLY
0
Entering edit mode

Hi Vivek and vchris_ngs,

I figured out that Agilent kit has been used and got the following four files (ref = reference used to capture exons):

1) ref.core.baits_merged.bed 2) ref.core.targets_merged.bed 3) ref_plus_utr.baits.bed 4) ref_plus_utr.targets.bed

The bed file which I created from gtf file of cds (assuming cds in bam files) seems differ from above. I am confused regarding the above files as well as which bed file I should use for further analysis.

Please guide.

Thanks!

ADD REPLY
2
Entering edit mode

I'd assume ref.core.targets_merged.bed is a subset of ref_plus_utr.targets.bed. If so that is the file you should use, if not merge the files together using BEDtools merge and use the resulting bed file. You can ignore the baits bed files.

ADD REPLY
0
Entering edit mode

Thanks a lot Vivek.

1) As far as I understood: ref.core.targets_merged.bed = exons only while ref_plus_utr.targets.bed = exons + UTR

2) Some values of 2nd and 3rd columns from both files are same, some are different and sometimes 3rd column is same but 2nd is different. Also, ref.core.targets_merged.bed has more rows.

Would you please explain the logic of having two bed files and merging them?

As my aim is exome analysis, should I use utr containing bed file also?

Thanks again.

ADD REPLY
1
Entering edit mode

Anything I can suggest here is a guess and you should be getting more information from the people who sequenced the samples as to how they designed the experiment and which regions they planned to target as part of the exome capture.

My intuition with merging the bed files is that it consolidates the coordinates between Exons + UTR regions. Since the UTR file was given to you, I'd think they also planned to target the UTR region.

Why merge them if one file is UTR + Exons and the other is just Exons?

Because the small region from TSS to first exon would likely be common between both files for many genes so you might as well combine them and get a single consolidate file of coordinates.

ADD REPLY
0
Entering edit mode

Thank you so much Vivek.

ADD REPLY
0
Entering edit mode
2.3 years ago
SergFly ▴ 50

Using this commands, you can calculate breadth of coverage and depth of coverage. But only for one genome in the file bam. Can anyone suggest how to calculate for multiple genomes in one bam file? So that for each genome there is output of the width and depth of coverage. One by one it takes a long time to calculate. For example, outputting results for multiple genomes:

breadth of coverage  depth of coverage
genome1 70   200
genome2 80   300
...
genomen 15   100

breadth of coverage

./samtools depth -a ./My_data/aln_sort.bam | awk '{c++; if($3>0) total+=1}END{print (total/c)*100}'

average depth

./samtools depth -a /home/sergey/samtools-1.9/My_data/aln_sort.bam | awk '{c++;s+=$3}END{print s/c}'
ADD COMMENT

Login before adding your answer.

Traffic: 1977 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6