This is the same problem I recently had, but I was not able to find a good solution. I was finally able to work a method which I am sharing here. Specially for empirical biologists who know what to do but not what to do exactly this solution might be an opener. I hope this will be helpful for others.
Identification of the depth of coverage is quite useful in 1) identifying the regions that might have potential paralogous alignment 2) finding the coverage at your regions of interest.
Here is a step by step guide to the problem/question.
Step #1) First identify the depth at each locus from a bam file.
I have found samtools depth option more useful in this regard, when coverage at each locus is desired.
samtools depth deduped_MA605.bam > deduped_MA605.coverage
The output file 'deduped_MA605.coverage' file will have 3 columns (Chr#, position and depth at that position) like below.
Chr position depth (this header will be absent though)
1 3980 66
1 3981 68
1 3982 67
1 3983 67
1 3984 68
Step #2) Now, select the coverage (depth) by locus for each chromosome and/or regions
We can use the coverage file to plot it in R. But, the file is so large that it will suck up almost all the memory. It better to split the coverage by chromosome (or region of the chromosome if required).
To select the coverage for a particular chromosome (Chr#1 in my case)
awk '$1 == 1 {print $0}' deduped_MA605.coverage > chr1_MA605.coverage
To select coverage from chr #2
awk '$1 == 2 {print $0}' deduped_MA605.coverage > chr2_MA605.coverage
If the chrosomosome has string characters it can be adjusted as
awk '$1 == "chr2" {print $0}' deduped_MA605.coverage > chr2_MA605.coverage
Step #3) To plot the data in R this coverage file will need to be imported and the headers need to be added.
MA605.chr2 <- read.table("/media/everestial007/SeagateBackup4.0TB/New_Alignment_Set/05-B-deDupedReads-forScanIndel/chr2_MA605.coverage",
header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)`
Note: The header of the column are automatically set as V1, V2 and V3.
To rename the headers
library(reshape)
# loads the library to rename the column names
MA605.chr2<-rename(MA605.chr2,c(V1="Chr", V2="locus", V3="depth"))
# renames the header
Now, plot the coverage by depth:
plot(MA605.chr2$locus, MA605.chr2$depth)
to get the wider/cleaner view of the plot use
library(lattice, pos=10) xyplot(depth ~ locus, type="p", pch=16, auto.key=list(border=TRUE), par.settings=simpleTheme(pch=16), scales=list(x=list(relation='same'), y=list(relation='same')), data=MA605.chr2, main="depth by locus - Chr2 (SampleMA605)")
The output plot looks like this:
Note:
We can see that there is a region around the middle and next to it that has a very high coverage. This likely suggests paralogous alignment.
Also, there is a gap in alignment. The region with no alignment is near (at) centromere where no consensus sequence has been found.
Reads may be later filtered by coverage.
I will post the details on how to filter by coverage in a week or more.
Thanks,
Hi Pierre, You script looks interesting. If you have some spare time, could you add some details about your script. Just requesting.
Thanks,