Question

Counting Instances Of "No Overlap"

3

Entering edit mode

12.8 years ago

ES ▴ 30

Hello,

I have mapped all my reads to a reference transcriptome and I would like to use the BAM/SAM file to count the number of instances along each transcript where there are adjacent mapped reads that do not overlap (e.g., the ends of the reads just "touch", but do not overlap creating a "gap" in the map). In other words, I want to count the number of instances along each transcript not supported by overlapping reads . Is there already a tool to do this?

Thanks,

ES

mapping bowtie transcriptome bam sam • 4.3k views

ADD COMMENT • link updated 12.8 years ago by Michael 55k • written 12.8 years ago by ES ▴ 30

2

Entering edit mode

What are you trying to observe by doing this? Are you trying to see if the library fragmentation is biased?

ADD REPLY • link 12.8 years ago by Damian Kao 16k

0

Entering edit mode

Another alternative, if I understand your question right, is to use IRanges and obtain all those reads which only overlap with themselves. But, following Dk, what's the goal?

ADD REPLY • link 12.8 years ago by Arun 2.4k

0

Entering edit mode

Thanks for the feedback. I am using as a reference a de novo transcriptome that I assembled using oases and I am mapping back (with bowtie2) the original reads (used in the assembly process) as a way to get a handle on how well the assembly program is doing. Basically as a way to diagnose potential misassemblies. As I was visualizing the BAM file, I ran into this pattern, which seems to me very odd so I would like to count how many times does it occur in all the transcripts and all transcriptomes (I have more than 20). I think this number can give me a "rough" idea of potential assembly problems (or mapping problems, too)

ADD REPLY • link 12.8 years ago by ES ▴ 30

0

Entering edit mode

Could you please look at the answers again, and valid one.

ADD REPLY • link 12.6 years ago by Michael 55k

score 0 · Answer 1 · 2012-08-03

0

Entering edit mode

12.8 years ago

Leonor Palmeira 3.9k

What you can do (in Python, R, or your favourite language) is take all the positions of mapped reads and make a continuous signal (of the number of reads mapping at each position) along the length of the transcriptome, then just get the positions where your signal equals 0.

In R, for one transcript, it would give something like this:

data=read.table("yourmappedreads.sam")
names(data)=c("readname","flag","hit","pos","mapq","cigar","mrnm","mpos","tlen","seq","qual")

k="mytranscript" # the name of the transcript we'll be working on
L=length(read.fasta("allyourtranscripts.fasta")[[k]])
who=grep(k,data$hit)
plus=who[which(data$flag[who]==0 | data$flag[who]==256)] # let's focus on reads mapping on the '+' strain

signal=rep(0,L)
lapply(seq(length(plus)), function(i) {
  strain=1
  beg=data$pos[plus[i]]
  end=beg+length(s2c(data$seq[plus[i]])) # this is only true if all your CIGAR strings end with 'M', see here for details: <http://biostars.org/post/show/7126/what-are-the-most-common-stupid-mistakes-in-bioinformatics/#50229>
  signal[beg:end] <<- signal[beg:end]+1
})

which(signal==0) # the regions you'll be interested in

ADD COMMENT • link 12.8 years ago by Leonor Palmeira 3.9k

0

Entering edit mode

Thanks for your idea Leonor! This is a good lead; however, if I understand your code correctly, this does not count the instances I am interested because my problem is that I can have reads mapping at each an all positions, but there are cases where three contiguous positions are not supported by a read. I mean something like this:

=================== de novo reference transcript

XXXXXXXXXXYYYYYYYYY

ZZZZZZZZZZWWWWWW

So, there are reads mapping in all positions, the problem is that there are positions in the transcript where only ends and starts or reads meet (the regions where X and Y, and Z and W meet) but there are no reads that overlap (there are no reads "threading" through the position where X and Y, and Z and W just meet)

Thanks, ES

ADD REPLY • link 12.8 years ago by ES ▴ 30

0

Entering edit mode

If your BAM/SAM file is sorted by coordinate, any instance where the next read is mapped at the end of the current read + 1 will be the event you are looking for. If it is sorted, there will be no reads across that junction. I think Leonor's code does that.

ADD REPLY • link 12.8 years ago by Damian Kao 16k

0

Entering edit mode

Unfortunately, ES is right, the code is incorrect due to the reasons given above.

ADD REPLY • link 12.5 years ago by Michael 55k

0

Entering edit mode

Michael and ES are both right. In that case, one possibility would be to store the positions of ends and starts of reads (which is what ES is really interested in). I will take an example similar as the one above:

======    de novo reference transcript
UUUVVV        
XXXYYY         
ZZWWWW

This would lead to store the following counts of reads (starts or ends) for each of the 6 positions on the reference sequence:

Starts of reads:

Ends of reads:

We would then just have to detect adjacent peaks in the previous two vectors (peaks in the 'ends' signal that are followed by a peak in the 'starts' signal).

ADD REPLY • link 12.5 years ago by Leonor Palmeira 3.9k

score 0 · Answer 2 · 2012-08-03

Would this solve what you are looking for? This is a very simple example. But this can be extended over all chromosomal positions and read with RangesList.

# dummy data.frame
> df <- structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", 
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), start = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 13L, 13L, 13L, 14L, 15L, 
17L, 18L, 19L, 20L, 20L), end = c(11L, 12L, 13L, 14L, 15L, 16L, 
17L, 18L, 19L, 20L, 21L, 22L, 23L, 25L, 28L, 29L, 30L, 30L, 30L, 
30L)), .Names = c("chr", "start", "end"), row.names = c(NA, -20L
), class = "data.frame")

What's in the dummy df?

> df
    chr start end
1  chr1     1  11
2  chr1     2  12
3  chr1     3  13
4  chr1     4  14
5  chr1     5  15
6  chr1     6  16
7  chr1     7  17
8  chr1     8  18
9  chr1     9  19
10 chr1    10  20
11 chr1    13  21
12 chr1    13  22
13 chr1    13  23
14 chr1    14  25
15 chr1    15  28
16 chr1    17  29
17 chr1    18  30
18 chr1    19  30
19 chr1    20  30
20 chr1    20  30

Rest of the code:

> require(IRanges)
> ir1 <- IRanges( start=df$start, end=df$end ) # create ranges
> ir2 <- IRanges( start=df$start, end=df$end+1 ) # create ranges with .+1 (@end position)

# find overlaps with itself and with . and .+1
> olaps1 <- findOverlaps( ir1, ir1 )
> olaps2 <- findOverlaps( ir1, ir2 )

# you are interested in those that occur in olaps2 but not in olaps1
> diff <- setdiff( olaps2, olaps1 )
# get data.frame
> df.olaps <- as.data.frame( diff )
> names(df.olaps) <- c("query", "subject")

# bind results
> res <- cbind( df[df.olaps$subject,2:3], df[df.olaps$query, 2:3] )
> res
        start end start end
2       2  12    13  21
2.1     2  12    13  22
2.2     2  12    13  23
3       3  13    14  25
4       4  14    15  28
6       6  16    17  29
7       7  17    18  30
8       8  18    19  30
9       9  19    20  30
9.1     9  19    20  30

Is this what you are trying for? No?

score 0 · Answer 3 · 2012-08-03

0

Entering edit mode

12.8 years ago

Jeremy Leipzig 23k

i'll give it a shot

library("Rsamtools")
library("GenomicRanges")

myreads<-readBamGappedAlignments("mybamfile.bam")
mycoverage<-coverage(myreads)

#find regions of depth 1
islands<-slice(mycoverage,1,1,rangesOnly=TRUE)
islands_gr<-GRanges(seqnames=names(unlist(islands)),ranges=unlist(islands))

#counts instances in which a read is stranded contained entirely within a depth-1 island
sum(countOverlaps(query=myreads,subject=islands_gr,type="within"))

#find reads that are contained entirely within these islands
overlaps<-findOverlaps(query=myreads,subject=islands_gr,type="within")
myreads[queryHits(overlaps)]

ADD COMMENT • link 12.8 years ago by Jeremy Leipzig 23k

0

Entering edit mode

Thanks for all the suggestions. I will give them a try!! the last 2 seem to do the trick!

ADD REPLY • link 12.8 years ago by ES ▴ 30

0

Entering edit mode

I was thinking about using depth 1 as well, but there could be multiple reads stacked on both sides of an "invalid juction" all not overlapping the junction, thus creating a gap with surrounding coverage >1.

ADD REPLY • link 12.8 years ago by Michael 55k

score 0 · Answer 4 · 2012-08-04

There is a pretty simple solution that doesn't require overlap calculation and uses only coverage. The advantage is that coverage can be computed in linear time (of reference sequence length if ranges are sorted) and coverage is a built-in function in IRanges. First an example how this works (using your example):

A: trying to detect 'gapsize' 0
=========^^======== reference with spurious junction ^^ coverage is 2 though at ^^
XXXXXXXXXXYYYYYYYYY reads X and Y
ZZZZZZZZZZWWWWWW reads Z and W
B: trying to detect gapsize 1:
=========^-======== reference with spurious junction, coverage at - is 0
XXXXXXXXXX-YYYYYYYYY reads X and Y
ZZZZZZZZZZ-WWWWWW reads Z and W
C: valid junction
=========^^======== reference probably correct junction ^^ 
XXXXXXXXXXYYYYYYYYY reads X and Y
    ZZZZZZZZZZ read Z covering junction

As one can see, case A is not detectable by using coverage but case B is, because the coverage is 0 in some point. Now, simple approach to detect case A without computing overlaps using only function gaps:

chop of 1 one both sides for each alignment, then look for regions where coverage=0 and length >=2; or check for length == 2, if only exactly event A is required; event B and C will not be affected by this.

=========^^======== reference with spurious junction ^^ coverage is 0 at ^^
-XXXXXXXX--YYYYYYY- reads X and Y  shortened
-ZZZZZZZZ--WWWWW- reads Z and W shortened

Code example, I am not at all caring about trivial stuff like reads of length 2, gaps at the beginning end end of reference sequences (subtract them, etc)

library("Rsamtools")
library("GenomicRanges")
# get example data:
example(readBamGappedAlignments)
[......]

gal2 # seems to be a god example
gr <- as(gal2,"GRanges")
gr.chopped <- resize(gr, fix="center", width=width(gr)-2) # resize the reads
mygaps <- gaps(gr.chopped) # get all the gaps
# show all gaps, in fact they could be some at the beginning of the reference sequence,   
mygaps[width(mygaps) == 2, ]

sum(width(mygaps) >= 2)
[1] 10
# seems like case A is not found,
# maybe you could make a little better example?
sum(width(mygaps) == 2) 
[1] 0

## get local coverage of the surrounding original reads for case A:
mygaps2 <- mygaps[width(mygaps) == 2,] # select gaps of length 2
mycoverage <- coverage(gr)
myView <- Views(mycoverage, as(mygaps2, "RangesList"))
myView$seq1
myView$seq2

Remarks:

After that, you can start thinking about the probability of observing event A: just by chance, given average sequence coverage.
Do you want to take larger gaps into account?
Does the probability of missing a valid read for a junction really depend on the local coverage?
Absence of evidence is not evidence of absence. Missing an alignment spanning a junction does not directly prove this juction is due to a misassembly.