Question

How To Determine If There Is A Cluster Of Reads Mapping To A Different Chromosome

7

Entering edit mode

14.7 years ago

Allpowerde ★ 1.3k

Say you have sequence data from a targeted region, so all the reads are suppose to map on chr10 between 700k and 600k. However, if there were an insertion from another chrom/region in an individual compared to the reference some of the reads would map to a concise region other than the expected.

I hence want to know the maximal coverage achieved on any other region than chr10:700k-600k. To do that I would need the maximal consecutive span at an outside region and how many reads map there, to produce something like:

chr1:300000-300100 max coverage 3  average QUAL 20
chr1:200500-300700 max coverage 200 average QUAL 50
chr11:300000-300100 max coverage 10 average QUAL 25

Does anyone have such a script, e.g. in sampy? Or do you use structural variant detection programs for this, e.g. SVDetect?

sequencing bam • 3.3k views

ADD COMMENT • link updated 14.7 years ago by Michael 55k • written 14.7 years ago by Allpowerde ★ 1.3k

Ram · Answer 1 · 2010-08-19

I typically use the IRanges, GenomicRanges, and ShortReads packages in Bioconductor for things like these. They contain a very efficient set of high-level tools for dealing with intervals and coverage. have a look at these functions especially:

coverage(x) to compute a run length encoded vector of coverage for each position in x
reduce() condenses regions (overlapping reads are joined into 'normal' intervals)
restrict() to restrict a regions to genomic coordinates
Views() and slice() to find certain regions with high coverage

An example how easy this is, given your read positions for all chromosomes are in a GenomicRanges/RangedData object gr:

mycoverage = coverage ( gr[1] )

gives you the run length encoded coverage for each base on chromosome 1. Now, lets detect regions where the coverage is at least 1:

myviews = slice (mycoverage, lower=1)

And then, get the mean coverage and max coverage (just saw that u wanted max) for the regions with min coverage of 1 that we just found:

viewMeans(myviews)
viewMaxs(myviews)

the maximum coverage can also be computed for the whole chrom just like this:

max(coverage(gr[1]))
# or for all data combined:
max(coverage(gr))

And finally, the lengths of all the regions with min coverage of 1, of which we select the maximum:

which.max(viewApply(myviews, length))

I hope, this shows the principles and the concepts behind. If you need help getting the SAM files into R, just put a comment.

The Rsamtools package can also read BAM files.

Here is some r-code to simulate some toy-data to start with:

simulate.rangeddata <-  function (n, mean.length, genome.size, nchr=1){
    ir = IRanges(start=as.integer(runif(n, min=1, max=genome.size)), width=rpois(n, mean.length))
    rd = RangedData(ir , p.value = pt(rt(n,df=100), df=100), space=rep(paste("chr", 1:nchr, sep=""), length=n ), universe="Mygenome")
    # added some p-values even, but could be changed for e.g. quality score
    return(rd)
}

gr = simulate.rangeddata(n=10000, mean.length=100, genome.size=1E6, nchr=10)

score 1 · Answer 2 · 2010-08-19

I'm assuming that you've already mapped the reads using an aligner (something like BWA). I'd just use the mergeBed tool from BEDtools to condense the reads to regions. It's fast and easy to use.

Then it'll just be a matter of finding the longest one (stop - start). A line or two from your favorite scripting language will make that easy.

Note that this won't give you quality scores, but once you have the largest regions, you can go back and intersect with the original reads to figure out the average score, number of reads, coverage, etc.