I have a text file with 4 columns: gi, evalue, start, stop.
There's several duplicates of each gi but they're not identical in coverage, so I want to extract from the text file only certain rows: those HSPs or hits (this a tblastn output) which have the greatest coverage for that gi.
HOWEVER, gi isn't enough of a determinant, because there is often more than one 'top' hsp per genome/gi. That is, there is more than one copy of the gene in a genome. Evalue doesn't seem to be enough a decider here in any scenario so column 2 can be ignored. Once the hits for a particular gi are distinguished based on their genome positions, I will simply take the hit which has the longest coverage (stop-start).
What I would like to do is in this intermediate step is, per gi entry (e.g. below), identify separate groups of hits/hsps based on their start/stop positions, or rather employ some sort of hierarchal clustering to set the rows in groups which could for instance be denoted in a 5th "ID" column, values 1,2,3 etc.
I don't yet know how to go about it (most of the clustering literature out there seems to go beyond what I need) but for instance, if the first row in a particular gi group was assigned ID 1, if the other rows in that group had start and stop values that differed by more than 600 nt say to those of this one, they would belong to a separate group (i.e. impose a cut-off 600). And so on, with hits not belonging to one used to nucleate other subgroup (2,3 etc.).
I started to employ numpy and generated a difference array for all start values with all other start values, and the same for stop, but the scripting to go through each row and call on this array was going to get very convoluted. I know scipy has some clustering algorithms but I can't tell if that's applicable to my problem.
If there is a simpler route you can think of please let me know.
Below is an extract of the data, where column gi is an extract for all values for that particular gi entry. In this entry there is an additional column showing sequence lengths.
99036121 0.0 1392057 1390123 1934
99036121 0.0 1392099 1390123 1976
99036121 0.0 1392111 1390123 1988
99036121 0.0 1392123 1390123 2000
99036121 0.0 1392123 1390543 1580
99036121 0.0 1730139 1728823 1316
99036121 0.0 1730139 1728829 1310
99036121 0.0 1768983 1767775 1208
99036121 1e-133 1768950 1767778 1172
99036121 1e-69 1768983 1768216 767
99036121 1e-77 1390509 1390123 386
99036121 1e-83 1768983 1767787 1196
99036121 2e-117 1768563 1767775 788
99036121 2e-58 1768983 1768279 704
99036121 3e-121 1768950 1767775 1175
99036121 3e-123 1768950 1767775 1175
99036121 3e-133 1768983 1767775 1208
99036121 3e-93 1768428 1767775 653
99036121 4e-101 1768950 1767775 1175
99036121 4e-135 1768983 1767775 1208
99036121 4e-136 1768983 1767775 1208
99036121 5e-133 1768983 1767787 1196
99036121 5e-136 1768983 1767775 1208
99036121 6e-112 1768542 1767775 767
99036121 6e-138 1768983 1767775 1208
99036121 7e-96 1768533 1767775 758
99036121 8e-136 1768983 1767775 1208
That's amazing, thank you. Your comment about start and end, its because this example is on the negative strand, will it be able to account for HSPs on the either strand? Is this is the install? http://www.bioconductor.org/packages/release/bioc/html/GenomicRanges.html
I ignored strand, but you can use strand in the constructor. You'll need to make sure that start is <= end, though, even in the case of reverse strand.
I see examples where you can manually define
strand=()
but how do you getIRanges
to look inz[,3]
orz[,4]
for start and end depending on which strand each row represents? I can introduce an extra field[,6]
of +/- based on the difference between[,3]
and[,4]
. Is it possible to put some 'if' statements in?Alternatively I can readjust my input to make sure
[,3]
is always less than[,4]
, put a another column in with +/- to preserve the strand data, and then swaps things back once I've got the output.Your latter approach is the way to go. I'd suggest a small function that takes your data.frame as read from disk and returns a
GRanges
object and a second function that takes aGRanges
object and returns a data.frame that conforms to whatever you need for output back to disk. For a technical detail, take a look at theifelse()
function.Thanks, I went with the latter approach and it did the job.