Typically mapping output will show many reads that are very close to each other. Meaning the difference of mapping location differ within 10 positions.
Is there any existing tool to cluster such reads from SAM output and compute the frequency of reads mapped in the cluster?
Surely we can write an all-against-all read comparison, but it takes very longtime to run.
Perhaps there are faster tool around to do this task?
Do you want non-redundant clusters (i.e. each read is in exactly one cluster), or (like Pierre's answer, I think) for each read all reads within 10 positions? And if the former, do you want transitive closure (single linkage), or some other type?
Do you want non-redundant clusters (i.e. each read is in exactly one cluster), or (like Pierre's answer, I think) for each read all reads within 10 positions? And if the former, do you want transitive closure (single linkage), or some other type? How do you want to deal with paired ends - should both ends be included in the same cluster?
as the BAM file are indexed I would use the following java program to count the reads in a 20pb window around one read. A first loop, reads all the SamRecord, and a second loop reads the records around the first one.
edit: another idea, as the records are sorted you could also only use only one loop and put the records in a queue, removing the most distant records at each iteration.
If you want single linkage clustering, it should be easy to parse SAM files to generate this with a linear traversal, just make sure the SAM file is sorted first. If you want to take paired ends into account, it becomes slightly more complicated.
For completeness, here's a single linkage solution, assuming we ignore paired reads, and that the distance of ten is between beginning of reads. The input is a list of (read,position), trivially parseable from the SAM file. ∞ is maxint for whatever type positions are.
clusters = go [] ∞ input
go c p0 ((rd,pos):rest) | pos-10 < p0 = go (rd:c) pos rest -- add rd to the cluster
| otherwise = c : go [] ∞ ((rd,pos):rest) -- output cluster and continue
go _ _ [] = []
This is Haskell. 'clusters' calls the recursive helper function 'go', which takes a cluster accumulator 'c', a current position 'p0', and the input list of SAM records, the first element being a pair of a read 'rd' and its position 'pos'. When 'pos'-10 < 'p0', we put 'rd' on top of the current cluster and recurse, if not we return the current cluster, and start a new one on the remaining input. The last line terminates the recursion when input is empty. Clear? :-)
Do you want non-redundant clusters (i.e. each read is in exactly one cluster), or (like Pierre's answer, I think) for each read all reads within 10 positions? And if the former, do you want transitive closure (single linkage), or some other type?
Do you want non-redundant clusters (i.e. each read is in exactly one cluster), or (like Pierre's answer, I think) for each read all reads within 10 positions? And if the former, do you want transitive closure (single linkage), or some other type? How do you want to deal with paired ends - should both ends be included in the same cluster?