This is the same problem I recently had, but I was not able to find a good solution. I was finally able to work a method which I am sharing here. While this question is 2 years old but I am hoping there are many individuals trying to work out the solution to this problem, so I hope this will be helpful for others.
Identification of the depth of coverage is quite useful in 1) identifying the regions that might have potential paralogous alignment 2) finding the coverage at your regions of interest.
Here is a step by step guide to the problem/question.
Step #1) First identify the depth at each locus from a bam file.
I have found samtools depth option more useful in this regard, when coverage at each locus is desired.
samtools depth deduped_MA605.bam > deduped_MA605.coverage
The output file 'deduped_MA605.coverage' file will have 3 columns (Chr#, position and depth at that position) like below.
Chr position depth (this header will be absent though)
1 3980 66
1 3981 68
1 3982 67
1 3983 67
1 3984 68
Step #2) Now, select the coverage (depth) by locus for each chromosome and/or regions
We can use the coverage file to plot it in R. But, the file is so large that it will suck up almost all the memory.
It better to split the coverage by chromosome (or region of the chromosome if required).
To select the coverage for a particular chromosome (Chr#1 in my case)
awk '$1 == 1 {print $0}' deduped_MA605.coverage > chr1_MA605.coverage
To select coverage from chr #2
awk '$1 == 2 {print $0}' deduped_MA605.coverage > chr2_MA605.coverage
If the chrosomosome has string characters it can be adjusted as
awk '$1 == "chr2" {print $0}' deduped_MA605.coverage > chr2_MA605.coverage
Step #3) To plot the data in R this coverage file will need to be imported and the headers need to be added.
MA605.chr2 <- read.table("/media/everestial007/SeagateBackup4.0TB/New_Alignment_Set/05-B-deDupedReads-forScanIndel/chr2_MA605.coverage",
header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)`
Note: The header of the column are automatically set as V1, V2 and V3.
To rename the headers
library(reshape)
# loads the library to rename the column names
MA605.chr2<-rename(MA605.chr2,c(V1="Chr", V2="locus", V3="depth"))
# renames the header
Now, plot the coverage by depth:
plot(MA605.chr2$locus, MA605.chr2$depth)
to get the wider/cleaner view of the plot use
library(lattice, pos=10) xyplot(depth ~ locus, type="p", pch=16, auto.key=list(border=TRUE), par.settings=simpleTheme(pch=16), scales=list(x=list(relation='same'), y=list(relation='same')), data=MA605.chr2, main="depth by locus - Chr2 (SampleMA605)")
The output plot looks like this:
Note:
We can see that there is a region around the middle and next to it that has a very high coverage. This likely suggests paralogous alignment.
Also, there is a gap in alignment. The region with no alignment is near (at) centromere where no consensus sequence has been found.
Reads may be later filtered by coverage.
Thanks,
Hi Deedee - thanks for the response. Can you many give some explanation on which tool in the Picard suite of tools can be used ?
Definitely! Answer updated.
I did try out CollectAlignmentSummaryMetrics from Picard - I'm not exactly sure that is what I am looking for. I have also tried BedTools' genomeCov.
Basically what I am looking for is:
Hi!
You can use the python program
bam2plot
, which I have developed. Install viapip install bam2plot
, and run:It produces one
png
and onesvg
for each reference (e.g. chromosome) in the bam file.Check it out here: https://github.com/willros/bam2plot
Regards, William
Hi William,
I was trying to use bam2plot you have developed. It was installed on a python evnviroment python version 3.11.
The installation went well without a problem. Here are the error messages I have received. the bam file is indexed and sorted. Please advise, and many thanks. Jinli
Hi Jinli,
It seems to be related to the pandas version you are using. Which version are you using? Try to install the latest version and see if it works then. Let me know otherwise!