Question

Simple clustering of start/stop nt positions in a text file column?

1

Entering edit mode

9.6 years ago

oliver.bayfield ▴ 210

I have a text file with 4 columns: gi, evalue, start, stop.

There's several duplicates of each gi but they're not identical in coverage, so I want to extract from the text file only certain rows: those HSPs or hits (this a tblastn output) which have the greatest coverage for that gi.

HOWEVER, gi isn't enough of a determinant, because there is often more than one 'top' hsp per genome/gi. That is, there is more than one copy of the gene in a genome. Evalue doesn't seem to be enough a decider here in any scenario so column 2 can be ignored. Once the hits for a particular gi are distinguished based on their genome positions, I will simply take the hit which has the longest coverage (stop-start).

What I would like to do is in this intermediate step is, per gi entry (e.g. below), identify separate groups of hits/hsps based on their start/stop positions, or rather employ some sort of hierarchal clustering to set the rows in groups which could for instance be denoted in a 5th "ID" column, values 1,2,3 etc.

I don't yet know how to go about it (most of the clustering literature out there seems to go beyond what I need) but for instance, if the first row in a particular gi group was assigned ID 1, if the other rows in that group had start and stop values that differed by more than 600 nt say to those of this one, they would belong to a separate group (i.e. impose a cut-off 600). And so on, with hits not belonging to one used to nucleate other subgroup (2,3 etc.).

I started to employ numpy and generated a difference array for all start values with all other start values, and the same for stop, but the scripting to go through each row and call on this array was going to get very convoluted. I know scipy has some clustering algorithms but I can't tell if that's applicable to my problem.

If there is a simpler route you can think of please let me know.

Below is an extract of the data, where column gi is an extract for all values for that particular gi entry. In this entry there is an additional column showing sequence lengths.

99036121    0.0    1392057    1390123 1934
99036121    0.0    1392099    1390123 1976
99036121    0.0    1392111    1390123 1988
99036121    0.0    1392123    1390123 2000
99036121    0.0    1392123    1390543 1580
99036121    0.0    1730139    1728823 1316
99036121    0.0    1730139    1728829 1310
99036121    0.0    1768983    1767775 1208
99036121    1e-133    1768950    1767778 1172
99036121    1e-69    1768983    1768216 767
99036121    1e-77    1390509    1390123 386
99036121    1e-83    1768983    1767787 1196
99036121    2e-117    1768563    1767775 788
99036121    2e-58    1768983    1768279 704
99036121    3e-121    1768950    1767775 1175
99036121    3e-123    1768950    1767775 1175
99036121    3e-133    1768983    1767775 1208
99036121    3e-93    1768428    1767775 653
99036121    4e-101    1768950    1767775 1175
99036121    4e-135    1768983    1767775 1208
99036121    4e-136    1768983    1767775 1208
99036121    5e-133    1768983    1767787 1196
99036121    5e-136    1768983    1767775 1208
99036121    6e-112    1768542    1767775 767
99036121    6e-138    1768983    1767775 1208
99036121    7e-96    1768533    1767775 758
99036121    8e-136    1768983    1767775 1208

clustering python blast Assembly • 2.7k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.6 years ago by oliver.bayfield ▴ 210

Ram · Accepted Answer · 2015-05-19

This is straightforward with Bioconductor GRanges. The process is roughly:

Load data for each GI
Create GRanges for each HSP
Use reduce() to collapse all overlapping hits into contiguous blocks

The return value, gr1, contains the three contiguous blocks of hits.

z = read.table(textConnection("99036121    0.0    1392057    1390123 1934
99036121    0.0    1392099    1390123 1976
                              99036121    0.0    1392111    1390123 1988
                              99036121    0.0    1392123    1390123 2000
                              99036121    0.0    1392123    1390543 1580
                              99036121    0.0    1730139    1728823 1316
                              99036121    0.0    1730139    1728829 1310
                              99036121    0.0    1768983    1767775 1208
                              99036121    1e-133    1768950    1767778 1172
                              99036121    1e-69    1768983    1768216 767
                              99036121    1e-77    1390509    1390123 386
                              99036121    1e-83    1768983    1767787 1196
                              99036121    2e-117    1768563    1767775 788
                              99036121    2e-58    1768983    1768279 704
                              99036121    3e-121    1768950    1767775 1175
                              99036121    3e-123    1768950    1767775 1175
                              99036121    3e-133    1768983    1767775 1208
                              99036121    3e-93    1768428    1767775 653
                              99036121    4e-101    1768950    1767775 1175
                              99036121    4e-135    1768983    1767775 1208
                              99036121    4e-136    1768983    1767775 1208
                              99036121    5e-133    1768983    1767787 1196
                              99036121    5e-136    1768983    1767775 1208
                              99036121    6e-112    1768542    1767775 767
                              99036121    6e-138    1768983    1767775 1208
                              99036121    7e-96    1768533    1767775 758
                              99036121    8e-136    1768983    1767775 1208
                              "),header=FALSE)
> library(GenomicRanges)
> # Note that start and end are switched in these example data
> # so you may need to be careful in the next line if this is 
> # not generally true.
> gr = GRanges(seqnames=z[,1],ranges=IRanges(start=z[,4],end=z[,3]))
> gr1 = reduce(gr)
> gr1
GRanges object with 3 ranges and 0 metadata columns:
      seqnames             ranges strand
         <Rle>          <IRanges>  <Rle>
  [1] 99036121 [1390123, 1392123]      *
  [2] 99036121 [1728823, 1730139]      *
  [3] 99036121 [1767775, 1768983]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths