Simple clustering of start/stop nt positions in a text file column?
1
1
Entering edit mode
9.6 years ago

I have a text file with 4 columns: gi, evalue, start, stop.

There's several duplicates of each gi but they're not identical in coverage, so I want to extract from the text file only certain rows: those HSPs or hits (this a tblastn output) which have the greatest coverage for that gi.

HOWEVER, gi isn't enough of a determinant, because there is often more than one 'top' hsp per genome/gi. That is, there is more than one copy of the gene in a genome. Evalue doesn't seem to be enough a decider here in any scenario so column 2 can be ignored. Once the hits for a particular gi are distinguished based on their genome positions, I will simply take the hit which has the longest coverage (stop-start).

What I would like to do is in this intermediate step is, per gi entry (e.g. below), identify separate groups of hits/hsps based on their start/stop positions, or rather employ some sort of hierarchal clustering to set the rows in groups which could for instance be denoted in a 5th "ID" column, values 1,2,3 etc.

I don't yet know how to go about it (most of the clustering literature out there seems to go beyond what I need) but for instance, if the first row in a particular gi group was assigned ID 1, if the other rows in that group had start and stop values that differed by more than 600 nt say to those of this one, they would belong to a separate group (i.e. impose a cut-off 600). And so on, with hits not belonging to one used to nucleate other subgroup (2,3 etc.).

I started to employ numpy and generated a difference array for all start values with all other start values, and the same for stop, but the scripting to go through each row and call on this array was going to get very convoluted. I know scipy has some clustering algorithms but I can't tell if that's applicable to my problem.

If there is a simpler route you can think of please let me know.

Below is an extract of the data, where column gi is an extract for all values for that particular gi entry. In this entry there is an additional column showing sequence lengths.

99036121    0.0    1392057    1390123 1934
99036121    0.0    1392099    1390123 1976
99036121    0.0    1392111    1390123 1988
99036121    0.0    1392123    1390123 2000
99036121    0.0    1392123    1390543 1580
99036121    0.0    1730139    1728823 1316
99036121    0.0    1730139    1728829 1310
99036121    0.0    1768983    1767775 1208
99036121    1e-133    1768950    1767778 1172
99036121    1e-69    1768983    1768216 767
99036121    1e-77    1390509    1390123 386
99036121    1e-83    1768983    1767787 1196
99036121    2e-117    1768563    1767775 788
99036121    2e-58    1768983    1768279 704
99036121    3e-121    1768950    1767775 1175
99036121    3e-123    1768950    1767775 1175
99036121    3e-133    1768983    1767775 1208
99036121    3e-93    1768428    1767775 653
99036121    4e-101    1768950    1767775 1175
99036121    4e-135    1768983    1767775 1208
99036121    4e-136    1768983    1767775 1208
99036121    5e-133    1768983    1767787 1196
99036121    5e-136    1768983    1767775 1208
99036121    6e-112    1768542    1767775 767
99036121    6e-138    1768983    1767775 1208
99036121    7e-96    1768533    1767775 758
99036121    8e-136    1768983    1767775 1208
clustering python blast Assembly • 2.7k views
ADD COMMENT
2
Entering edit mode
9.6 years ago

This is straightforward with Bioconductor GRanges. The process is roughly:

  • Load data for each GI
  • Create GRanges for each HSP
  • Use reduce() to collapse all overlapping hits into contiguous blocks

The return value, gr1, contains the three contiguous blocks of hits.

z = read.table(textConnection("99036121    0.0    1392057    1390123 1934
99036121    0.0    1392099    1390123 1976
                              99036121    0.0    1392111    1390123 1988
                              99036121    0.0    1392123    1390123 2000
                              99036121    0.0    1392123    1390543 1580
                              99036121    0.0    1730139    1728823 1316
                              99036121    0.0    1730139    1728829 1310
                              99036121    0.0    1768983    1767775 1208
                              99036121    1e-133    1768950    1767778 1172
                              99036121    1e-69    1768983    1768216 767
                              99036121    1e-77    1390509    1390123 386
                              99036121    1e-83    1768983    1767787 1196
                              99036121    2e-117    1768563    1767775 788
                              99036121    2e-58    1768983    1768279 704
                              99036121    3e-121    1768950    1767775 1175
                              99036121    3e-123    1768950    1767775 1175
                              99036121    3e-133    1768983    1767775 1208
                              99036121    3e-93    1768428    1767775 653
                              99036121    4e-101    1768950    1767775 1175
                              99036121    4e-135    1768983    1767775 1208
                              99036121    4e-136    1768983    1767775 1208
                              99036121    5e-133    1768983    1767787 1196
                              99036121    5e-136    1768983    1767775 1208
                              99036121    6e-112    1768542    1767775 767
                              99036121    6e-138    1768983    1767775 1208
                              99036121    7e-96    1768533    1767775 758
                              99036121    8e-136    1768983    1767775 1208
                              "),header=FALSE)
> library(GenomicRanges)
> # Note that start and end are switched in these example data
> # so you may need to be careful in the next line if this is 
> # not generally true.
> gr = GRanges(seqnames=z[,1],ranges=IRanges(start=z[,4],end=z[,3]))
> gr1 = reduce(gr)
> gr1
GRanges object with 3 ranges and 0 metadata columns:
      seqnames             ranges strand
         <Rle>          <IRanges>  <Rle>
  [1] 99036121 [1390123, 1392123]      *
  [2] 99036121 [1728823, 1730139]      *
  [3] 99036121 [1767775, 1768983]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD COMMENT
0
Entering edit mode

That's amazing, thank you. Your comment about start and end, its because this example is on the negative strand, will it be able to account for HSPs on the either strand? Is this is the install? http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html

ADD REPLY
1
Entering edit mode

I ignored strand, but you can use strand in the constructor. You'll need to make sure that start is <= end, though, even in the case of reverse strand.

ADD REPLY
0
Entering edit mode

I see examples where you can manually define strand=() but how do you get IRanges to look in z[,3] or z[,4] for start and end depending on which strand each row represents? I can introduce an extra field [,6] of +/- based on the difference between [,3] and [,4]. Is it possible to put some 'if' statements in?

Alternatively I can readjust my input to make sure [,3] is always less than [,4], put a another column in with +/- to preserve the strand data, and then swaps things back once I've got the output.

ADD REPLY
0
Entering edit mode

Your latter approach is the way to go. I'd suggest a small function that takes your data.frame as read from disk and returns a GRanges object and a second function that takes a GRanges object and returns a data.frame that conforms to whatever you need for output back to disk. For a technical detail, take a look at the ifelse() function.

ADD REPLY
0
Entering edit mode

Thanks, I went with the latter approach and it did the job.

ADD REPLY

Login before adding your answer.

Traffic: 1708 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6