Would this solve what you are looking for? This is a very simple example. But this can be extended over all chromosomal positions and read with RangesList
.
# dummy data.frame
> df <- structure(list(chr = c("chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1",
"chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"), start = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 13L, 13L, 13L, 14L, 15L,
17L, 18L, 19L, 20L, 20L), end = c(11L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 19L, 20L, 21L, 22L, 23L, 25L, 28L, 29L, 30L, 30L, 30L,
30L)), .Names = c("chr", "start", "end"), row.names = c(NA, -20L
), class = "data.frame")
What's in the dummy df?
> df
chr start end
1 chr1 1 11
2 chr1 2 12
3 chr1 3 13
4 chr1 4 14
5 chr1 5 15
6 chr1 6 16
7 chr1 7 17
8 chr1 8 18
9 chr1 9 19
10 chr1 10 20
11 chr1 13 21
12 chr1 13 22
13 chr1 13 23
14 chr1 14 25
15 chr1 15 28
16 chr1 17 29
17 chr1 18 30
18 chr1 19 30
19 chr1 20 30
20 chr1 20 30
Rest of the code:
> require(IRanges)
> ir1 <- IRanges( start=df$start, end=df$end ) # create ranges
> ir2 <- IRanges( start=df$start, end=df$end+1 ) # create ranges with .+1 (@end position)
# find overlaps with itself and with . and .+1
> olaps1 <- findOverlaps( ir1, ir1 )
> olaps2 <- findOverlaps( ir1, ir2 )
# you are interested in those that occur in olaps2 but not in olaps1
> diff <- setdiff( olaps2, olaps1 )
# get data.frame
> df.olaps <- as.data.frame( diff )
> names(df.olaps) <- c("query", "subject")
# bind results
> res <- cbind( df[df.olaps$subject,2:3], df[df.olaps$query, 2:3] )
> res
start end start end
2 2 12 13 21
2.1 2 12 13 22
2.2 2 12 13 23
3 3 13 14 25
4 4 14 15 28
6 6 16 17 29
7 7 17 18 30
8 8 18 19 30
9 9 19 20 30
9.1 9 19 20 30
Is this what you are trying for? No?
What are you trying to observe by doing this? Are you trying to see if the library fragmentation is biased?
Another alternative, if I understand your question right, is to use IRanges and obtain all those reads which only overlap with themselves. But, following Dk, what's the goal?
Thanks for the feedback. I am using as a reference a de novo transcriptome that I assembled using oases and I am mapping back (with bowtie2) the original reads (used in the assembly process) as a way to get a handle on how well the assembly program is doing. Basically as a way to diagnose potential misassemblies. As I was visualizing the BAM file, I ran into this pattern, which seems to me very odd so I would like to count how many times does it occur in all the transcripts and all transcriptomes (I have more than 20). I think this number can give me a "rough" idea of potential assembly problems (or mapping problems, too)
Could you please look at the answers again, and valid one.