I typically use the IRanges, GenomicRanges, and ShortReads packages in Bioconductor for things like these. They contain a very efficient set of high-level tools for dealing with intervals and coverage. have a look at these functions especially:
coverage(x)
to compute a run length encoded vector of coverage for each position in x
reduce()
condenses regions (overlapping reads are joined into 'normal' intervals)
restrict()
to restrict a regions to genomic coordinates
Views()
and slice()
to find certain regions with high coverage
An example how easy this is, given your read positions for all chromosomes are in a GenomicRanges/RangedData object gr
:
mycoverage = coverage ( gr[1] )
gives you the run length encoded coverage for each base on chromosome 1. Now, lets detect regions where the coverage is at least 1:
myviews = slice (mycoverage, lower=1)
And then, get the mean coverage and max coverage (just saw that u wanted max) for the regions with min coverage of 1 that we just found:
viewMeans(myviews)
viewMaxs(myviews)
the maximum coverage can also be computed for the whole chrom just like this:
max(coverage(gr[1]))
# or for all data combined:
max(coverage(gr))
And finally, the lengths of all the regions with min coverage of 1, of which we select the maximum:
which.max(viewApply(myviews, length))
I hope, this shows the principles and the concepts behind. If you need help getting the SAM files into R, just put a comment.
The Rsamtools package can also read BAM files.
Here is some r-code to simulate some toy-data to start with:
simulate.rangeddata <- function (n, mean.length, genome.size, nchr=1){
ir = IRanges(start=as.integer(runif(n, min=1, max=genome.size)), width=rpois(n, mean.length))
rd = RangedData(ir , p.value = pt(rt(n,df=100), df=100), space=rep(paste("chr", 1:nchr, sep=""), length=n ), universe="Mygenome")
# added some p-values even, but could be changed for e.g. quality score
return(rd)
}
gr = simulate.rangeddata(n=10000, mean.length=100, genome.size=1E6, nchr=10)