Print all rows that are in specified value range based on specific column as pairs
2
1
Entering edit mode
8.7 years ago
5heikki 11k

I made a related post to Stack Overflow the other day, but I figured that people here might e.g. know some bioinfo tools that do what I want..

So my data is in this kind of format (tab separated parsed from a sam file):

k_12_exc_170    272 FR667691.1  2646    ACATATAAGGTT
k_12_exc_132    256 FR667691.1  8100    AACATGGCGAGA
k_14_exc_1547   16  FR667691.1  8669    ATTATAAAGAATTT
k_13_exc_6130   272 FR667691.1  10710   GAGAAAGAACCAT
k_13_exc_6130   272 FR667691.1  10729   GAGAAAGAACCAT
k_13_exc_8289   16  FR667691.1  10827   TATTTATTGATGG
k_14_exc_19763  272 FR667691.1  15800   TTTAAAATGATTAC
k_15_exc_9848   16  FR667691.1  15800   TTTAAAATGATTACT
k_13_exc_3692   16  FR667691.1  16119   TGGACGAATAAGT
k_14_exc_2426   272 FR667691.1  16119   TGGACGAATAAGTT
k_12_exc_3570   272 FR667691.1  16436   GAACCATTCGAG

What I want to do, is to print all lines as pairs where the value in column 4 of the (to be) "right pair" is at least 100 larger than value in column 4 of the "left pair", but at the same time at most 200 larger. In other words, these lines define chromosomal regions, and I want to output each possible case where two lines are within the rules stated above. I can start this task from the sam file too. These files have ca. 100 k lines each and there can be thousands of them. They are sorted based on columns 3 and 4. Currently I'm doing pretty much what was suggested at Stack Overflow, except I first split each file into 1000 row files with additional 100 tail rows from the previously split file in head (so that many pairs would not be lost along the way). While this is considerably faster than whatever I was doing before, I can't help thinking that there should be some simpler and faster way to do it.. perhaps with bedtools or samtools or something like that?

With example data, only such row should be printed (but I can deal with other formats too but this info should be available):

 k_13_exc_6130  272 FR667691.1  10710   GAGAAAGAACCAT    k_13_exc_8289  16  FR667691.1  10827   TATTTATTGATGG

So in essence, I want to do conditional join..

awk • 4.7k views
ADD COMMENT
3
Entering edit mode
8.7 years ago
kloetzl ★ 1.1k

Here is a quick solution in C. Not beautiful but it should give you an idea of how an optimal solution might look like.

#include <stdio.h>
#include <stdlib.h>

const long LOWER = 100;
const long UPPER = 200;
const long MAX_LINES = 1000000;

struct line {
    char *foo, *bar, *  baz;
    long derp, herp;
};


int main()
{
    struct line *lines = malloc(MAX_LINES * sizeof(*lines));
    struct line *fillptr, *left, *right_upper, *right_lower;
    fillptr = left = right_upper = right_lower = lines;

    while ( 5 == scanf("%ms %li %ms %li %ms\n", &fillptr->foo, &fillptr->derp, &fillptr->bar, &fillptr->herp, &fillptr->baz)) {
        fillptr++;
    }

    // assumes lines are sorted by `herp`.
    while ( left < fillptr) {
        while (right_lower->herp < left->herp + LOWER && right_lower < fillptr) {
            right_lower++;
        }
        while (right_upper->herp < left->herp + UPPER && right_upper < fillptr) {
            right_upper++;
        }
        struct line *ptr = right_lower;
        while (ptr < right_upper) {
            printf("%s\t%li\t%s\t%li\t%s", left->foo, left->derp, left->bar, left->herp, left->baz);
            printf("%s\t%li\t%s\t%li\t%s\n", ptr->foo, ptr->derp, ptr->bar, ptr->herp, ptr->baz);
            ptr++;
        }
        free(left->foo);
        free(left->bar);
        free(left->baz);
        left++;
    }

    free(lines);
    return 0;
}

Untested except for that tiny example. ☺

ADD COMMENT
0
Entering edit mode

Wow. What an incredibly complex answer. Certainly must be efficient though. Perhaps assembly language would be even faster?

For me, this would be the perfect task for R.

If you're looking for absolute maximal efficiency, try the data.table package in R. fread is very very, fast. Personally, I'm used to working with the traditional data.frames so I give the argument data.table=FALSE, even at the slight cost of a loss in efficiency. I use dplyr for the filtering, which is also very fast.

ADD REPLY
0
Entering edit mode

What an incredibly complex answer. Certainly must be efficient though.

It is not complex at all. It just looks like that if you are not used to pointers. The efficiency is in the algorithm: I basically reuse a window of candidates bounded by right_lower and right_upper to quickly determine the pairs for the next iteration. As the input is sorted, the algorithm runs in time O(output).

Perhaps assembly language would be even faster? […] If you're looking for absolute maximal efficiency, try the data.table package in R. fread is very very, fast.

Use the right tool, for the right job. In this case I needed a programming language able to store structures (or objects, whatever) in an array. As AWK lacks that feature, I switched to the next best language at hand. This is propably a one-shot program, with little reuse value. Thus the need for performance is actually small.

ADD REPLY
0
Entering edit mode

I'm sure it's a good answer. You're right. It's just incomprehensible to me. I'll have to give points to the Perl answer below, but I will have to study C more to understand your code better.

The way the question is asked, it's very difficult to come up with an efficient algorithm anyway, regardless of the language used.

ADD REPLY
0
Entering edit mode

Hi, thanks a lot for your effort. Unfortunately, I really suck at C. I suppose if I move LOWER, UPPER and MAXLINES to main, I can pass on their values as args, like:

int main(int argc, char *argv[])
{
    const long LOWER = atoi(argv[1]);
    const long UPPER = atoi(argv[2]);
    const long MAX_LINES = 1000000;

..

At least it compiled without problems. However, I still did not figure out how to pass a file to the program (or read from STDIN).

ADD REPLY
0
Entering edit mode

Use - as third param to read from stdin. Updated version:

ADD REPLY
0
Entering edit mode

Compiled without problems but does not output anything with the test data. Perhaps it's because I'm on OS X right now, and Google tells me that OS X fscanf() is not POSIX 2008 compliant. Also tried to compile with -std=c99. I will try it with CentOS tomorrow. Also, as these coordinates will never reach 100s of millions (data from prokaryotes only), is it really necessary to use long instead of int? Could such thing affect performance noticeably? I know the definition of int in C is a little vague, but I would think that all modern systems use 4 bytes instead of 2 for int..

gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.3.0 (clang-703.0.29)
Target: x86_64-apple-darwin15.0.0
Thread model: posix
ADD REPLY
0
Entering edit mode

The OS X is not the problem. Maybe your data is slightly differently formatted than assumed by line 24. Unfortunately that is something only you can debug. If you add fprintf(stderr,"%zu",fillptr-lines); in line 27 and the output is NOT the number of lines in the input it will give you the problematic line (or some location near it).

ADD REPLY
0
Entering edit mode

No output. I'm almost certain it's because of OS X. BSD fscanf does not know about %m.

http://linux.die.net/man/3/fscanf POSIX 2008

vs

http://www.manpagez.com/man/3/fscanf/ ISO C99

ADD REPLY
0
Entering edit mode

You could try adding --std=gnu99. Or link to a different libc. Though I guess it would be easier to just run it on Linux. Dammit, Apple. YUNO%ms‽

ADD REPLY
0
Entering edit mode

So no problems compiling on Linux.

I wonder though, why this works:

printf("%s\t%li\t%s\t%li\t%s\n", ptr->foo, ptr->derp, ptr->bar, ptr->herp+strlen(ptr->baz)-1, ptr->baz);

While this causes segmentation fault during run:

while (right_upper->herp < left->herp + UPPER + strlen(right_upper->baz)-1 && right_upper < fillptr)
ADD REPLY
0
Entering edit mode

Swap the two parts of the condition. By right_upper < fillptr you ensure that right_upper still points to a valid region. Only if that is true, you can follow right_upper->baz. My code above also contains that bug, but I guess, I just was lucky. ☺

ADD REPLY
0
Entering edit mode

Thanks.I think my first attempt took ca. 40 min per file, while the Stack Overflow inspired method dropped it to ca. 8 min. Now 1) splitting input file on column 3 with awk, 2) processing and deleting all split files, and 3) combining output to a new file takes altogether ca. 6 seconds. I can definitely live with this. Also, implementing the 3rd column check to your code will be nice C practice for me. You and Jorge Amigo both provided great answers. I can only accept one. You answered first, so congrats :) Maybe later when I have benchmarked all approaches properly I will change first post.

ADD REPLY
0
Entering edit mode

Oh and in case I will accidentally delete source or something:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct line {
  char *seqName, *chrName, *kmer;
  long bitFlag, position;
};

int main(int argc, char const *argv[])
{
  long LOWER = atol(argv[1]);
  long UPPER = atol(argv[2]);
  long MAX_LINES = 1000000;
  FILE* input = stdin;
  if (argc == 4 && strcmp(argv[3],"-") != 0) {
    input = fopen(argv[3], "r");
  }

  struct line *lines = malloc(MAX_LINES * sizeof(*lines));
  struct line *fillptr, *left, *right_upper, *right_lower;
  fillptr = left = right_upper = right_lower = lines;

  while ( 5 == fscanf( input, "%ms %li %ms %li %ms\n", &fillptr->seqName, &fillptr->bitFlag, &fillptr->chrName, &fillptr->position, &fillptr->kmer)) {
    fillptr++;
  }

  // assumes lines are sorted by `position`.
  // assumes all values in column 3 of input data are the same
  while (left < fillptr) {
    while (right_lower < fillptr && right_lower->position < left->position + LOWER - strlen(right_lower->kmer) + 1) {
      right_lower++;
    }
    while (right_upper < fillptr && right_upper->position < left->position + UPPER - strlen(right_upper->kmer) + 1) {
      right_upper++;
    }
    struct line *ptr = right_lower;
    while (ptr < right_upper) {
      printf("%s\t%li\t%s\t%li\t%s\t", left->seqName, left->bitFlag, left->chrName, left->position, left->kmer);
      printf("%s\t%li\t%s\t%li\t%s\n", ptr->seqName, ptr->bitFlag, ptr->chrName, ptr->position + strlen(ptr->kmer) - 1, ptr->kmer);
      ptr++;
    }
    free(left->seqName);
    free(left->chrName);
    free(left->kmer);
    left++;
  }
  free(lines);
  fclose(input);
  return 0;
}
ADD REPLY
0
Entering edit mode

AWK is great so so much, but some times it pays off to write 50 lines of C. ☺

ADD REPLY
0
Entering edit mode

So I have this problem with the code. When my input data includes 11mer and 12mers, the length of output regions varies between 101 and 200 bp. Then when my input data includes 11mers, 12mers and 13mers, the length of output regions varies between 100 and 200 bp. Then when input data includes 11mers, 12mers, 13mers and 14mers, the length of output regions varies between 99 and 200 bp. As far as I can tell, this rule holds true always, so with 11-15mers, I get 98-200 bp regions, with 11-16mers, 97-200 bp, etc. I thought that perhaps this was somehow related to the strlen stuff, so I even added a 6th column to input data which is "col 4 + lenght col 5 - 1". Then I changed the code as follows:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct line {
  char *seqName, *chrName, *kmer;
  long bitFlag, start, end;
};

int main(int argc, char const *argv[])
{
  long LOWER = atol(argv[1]);
  long UPPER = atol(argv[2]);
  long MAX_LINES = 10000000;
  FILE* input = stdin;
  if (argc == 4 && strcmp(argv[3],"-") != 0) {
    input = fopen(argv[3], "r");
  }

  struct line *lines = malloc(MAX_LINES * sizeof(*lines));
  struct line *fillptr, *left, *right_upper, *right_lower;
  fillptr = left = right_upper = right_lower = lines;

  while ( 6 == fscanf( input, "%ms %li %ms %li %ms %li\n", &fillptr->seqName, &fillptr->bitFlag, &fillptr->chrName, &fillptr->start, &fillptr->kmer, &fillptr->end)) {
    fillptr++;
  }

  // assumes lines are sorted by start
  // assumes all values in column 3 of input data are the same
  while (left < fillptr) {
    while (right_lower < fillptr && right_lower->end < left->start + LOWER) {
      right_lower++;
    }
    while (right_upper < fillptr && right_upper->end < left->start + UPPER) {
      right_upper++;
    }
    struct line *ptr = right_lower;
    while (ptr < right_upper) {
      printf("%s\t%li\t%s\t%li\t%s\t", left->seqName, left->bitFlag, left->chrName, left->start, left->kmer);
      printf("%s\t%li\t%s\t%li\t%s\n", ptr->seqName, ptr->bitFlag, ptr->chrName, ptr->end, ptr->kmer);
      ptr++;
    }
    free(left->seqName);
    free(left->chrName);
    free(left->kmer);
    left++;
  }
  free(lines);
  fclose(input);
  return 0;
}

The coordinates I get correspond to the kmers that are supposed to be there perfectly, however, the length problem persists and I have absolutely no clue of what is causing it..

ADD REPLY
0
Entering edit mode

I have trouble reproducing your problem: The data given in the start post has 5 columns, whereas your code expects a sixth. (end coordinate)

ADD REPLY
0
Entering edit mode

I just realized that it may be a sorting issue. If not, I will post some more example data..

ADD REPLY
0
Entering edit mode

Well, I could not fix it. Went back to five column input. For example with this data:

k_12_exc_3093   0   CP002   1183936 CGACATATCATC
k_13_exc_8931   0   CP002   1183936 CGACATATCATCG
k_14_exc_14624  0   CP002   1183936 CGACATATCATCGC
k_16_exc_7120   0   CP002   1183959 ACCAGTGTCTGATGTA
k_15_exc_16357  0   CP002   1183960 CCAGTGTCTGATGTA
k_15_exc_12020  0   CP002   1184023 ATGAAAGGACCGTTC
k_16_exc_12332  0   CP002   1184023 ATGAAAGGACCGTTCA
k_11_exc_451    16  CP002   1184024 TGAAAGGACCG
k_12_exc_959    16  CP002   1184024 TGAAAGGACCGT
k_13_exc_1025   16  CP002   1184024 TGAAAGGACCGTT

Expected output is (added column 11 to display length):

k_12_exc_3093   0   CP002   1183936 CGACATATCATC    k_15_exc_12020  0   CP002   1184037 ATGAAAGGACCGTTC 101
k_12_exc_3093   0   CP002   1183936 CGACATATCATC    k_16_exc_12332  0   CP002   1184038 ATGAAAGGACCGTTCA    102
k_12_exc_3093   0   CP002   1183936 CGACATATCATC    k_12_exc_959    16  CP002   1184035 TGAAAGGACCGT    99
k_12_exc_3093   0   CP002   1183936 CGACATATCATC    k_13_exc_1025   16  CP002   1184036 TGAAAGGACCGTT   100
k_13_exc_8931   0   CP002   1183936 CGACATATCATCG   k_15_exc_12020  0   CP002   1184037 ATGAAAGGACCGTTC 101
k_13_exc_8931   0   CP002   1183936 CGACATATCATCG   k_16_exc_12332  0   CP002   1184038 ATGAAAGGACCGTTCA    102
k_13_exc_8931   0   CP002   1183936 CGACATATCATCG   k_12_exc_959    16  CP002   1184035 TGAAAGGACCGT    99
k_13_exc_8931   0   CP002   1183936 CGACATATCATCG   k_13_exc_1025   16  CP002   1184036 TGAAAGGACCGTT   100
k_14_exc_14624  0   CP002   1183936 CGACATATCATCGC  k_15_exc_12020  0   CP002   1184037 ATGAAAGGACCGTTC 101
k_14_exc_14624  0   CP002   1183936 CGACATATCATCGC  k_16_exc_12332  0   CP002   1184038 ATGAAAGGACCGTTCA    102
k_14_exc_14624  0   CP002   1183936 CGACATATCATCGC  k_12_exc_959    16  CP002   1184035 TGAAAGGACCGT    99
k_14_exc_14624  0   CP002   1183936 CGACATATCATCGC  k_13_exc_1025   16  CP002   1184036 TGAAAGGACCGTT   100

However, in addition to those, I get (again added length for clarity):

k_12_exc_3093   0   CP002   1183936 CGACATATCATC    k_11_exc_451    16  CP002   1184034 TGAAAGGACCG 98
k_13_exc_8931   0   CP002   1183936 CGACATATCATCG   k_11_exc_451    16  CP002   1184034 TGAAAGGACCG 98
k_14_exc_14624  0   CP002   1183936 CGACATATCATCGC  k_11_exc_451    16  CP002   1184034 TGAAAGGACCG 98

edit. I fixed it by making the printf stuff conditional..

ADD REPLY
2
Entering edit mode
8.7 years ago

this quick perl idea stores all chromosome positions in memory (considering that the data is sorted by position it could/should be optimized if accessing a large sam file) and then loops through them all looking for the existence of positions >=100bp and <=200bp on the same chromosome as required:

perl -lane '{
$data{$F[2]}{$F[3]}{$_} = 1
} END {
$top = 200;
$low = 100;
foreach $chr (keys %data) {
 foreach $pos (keys %{$data{$chr}}) {
  foreach $tempPos ($pos+$low..$pos+$top) {
if (exists $data{$chr}{$tempPos}) {
 foreach $line1 (keys %{$data{$chr}{$pos}}) {
  foreach $line2 (keys %{$data{$chr}{$tempPos}}) {
    print "$line1\t$line2\n"
}}}}}}}' test.txt

note that this code will print all possible sequence combinations that match that position difference requirement stated, so you may find more than 1 possible pair for a particular sequence, or even the same pair for several sequences. maybe some more constraints should be added rather than only looking to chromosome positions.

EDIT: assuming that the input file is sorted by chromosome and by position, a much faster and memory efficient way of doing it (you could use it with any sam file, independently of its size) would be this:

perl -lane '{
$chr = $F[2];
$pos = $F[3];
$posTop = $pos - 100;
$posLow = $pos - 200;
if ($chr ne $prevChr) {%chrData = ()}
$chrData{$pos}{$_} = 1;
foreach $tempPos (sort {$a<=>$b} keys %chrData) {
 if ($tempPos < $posLow) {delete $chrData{$tempPos}}
 elsif ($tempPos <= $posTop) {
 foreach $line1 (keys %{$chrData{$tempPos}}) {
  foreach $line2 (keys %{$chrData{$pos}}) {
    print "$line1\t$line2\n"
}}}}
$prevChr = $chr;
}' test.txt
ADD COMMENT
0
Entering edit mode

Thanks, this one worked. I usually avoid perl because I absolutely despise hunting down dependencies of some old perl scrips (horrible memories from for example HAL). Thankfully, there are none here. Hopefully kloetzl's C works on my CentOS box so I can do some benchmarking tomorrow.

ADD REPLY
0
Entering edit mode

C is usually faster than Perl but, due to its simplicity, I'm almost certain that the second piece of code is faster than the C solution presented here.

ADD REPLY
0
Entering edit mode

How should I modify the second piece if I want to account for lenght($F[4]) (the 5th column) contributing to the length interval in case of the "right pair". I could of course first add end coordinates with e.g. awk and then modify the script slightly, but it would be nice if I could skip that..

ADD REPLY
0
Entering edit mode

you mean that you'd like to add the length of the sequence to the length of the interval, so that the interval would be calculated from the beginning of the first sequence to the end of the second sequence? if that's the case you'll only have to change the definition of the limits. but there's a problem with this you may have not considered. if the length of the sequences is to be considered, then the position sort wouldn't be enough to confidently use this second code. you would need to remove the delete $chrData{$tempPos} line, which will make the whole script slower. or else you could consider an evolution of the first one.

the entire code should look like this:

perl -lane '{
$chr = $F[2];
$pos = $F[3];
$seqLength = length($F[4]);
$posTop = $pos + $seqLength - 100;
$posLow = $pos + $seqLength - 200;
if ($chr ne $prevChr) {%chrData = ()}
$chrData{$pos}{$_} = 1;
foreach $tempPos (keys %chrData) {
 if ($tempPos < $posLow and $tempPos <= $posTop) {
 foreach $line1 (keys %{$chrData{$tempPos}}) {
  foreach $line2 (keys %{$chrData{$pos}}) {
    print "$line1\t$line2\n"
}}}}
$prevChr = $chr;
}' test.txt
ADD REPLY
0
Entering edit mode

I have to say it looks a lot better than what I tried. But as you said, it's slow. Too slow. In have no idea why in my sh###y attempt it had to be "$kmerLen-2" for $posTop so the coordinates yielded correct seqs (from the small sample I tried anyway). P.s. I didn't specify in 1st post the complete problem because building on top of "skeleton code" provides great learning opportunity and forces me to understand the code atleast a little bit

export MINL=100;
export MAXL=200;.
perl -lane '{
$chr=$F[2];
$pos=$F[3];
$kmerLen=length($F[4]);
$posTop=$pos-$ENV{'MINL'}+$kmerLen-2;
$posLow=$pos-$ENV{'MAXL'}+$kmerLen;
if ($chr ne $prevChr){%chrData=()}
$chrData{$pos}{$_}=1;
foreach $tempPos (sort {$a<=>$b} keys %chrData)
{if($tempPos < $posLow){delete $chrData{$tempPos}}
elsif($tempPos <= $posTop)
{foreach $line1(keys %{$chrData{$tempPos}})
{foreach $line2 (keys %{$chrData{$pos}})
{print "$line1\t$line2"}}}}$prevChr = $chr;}' test.txt | awk 'BEGIN{OFS=FS="\t"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9+length($10)-1,$10}'
ADD REPLY
0
Entering edit mode

what if you care about sequence length only for the intervals, but not for the positions removal? it only requires adding a new check inside the main foreach loop, and it should make the whole code as fast as before:

perl -lane 'BEGIN{$minLength = 100-1; $maxLength = 200-1} {
$chr = $F[2];
$pos = $F[3];
$seqLength = length($F[4]);
$posMin = $pos - $maxLength;
$posLow = $pos + $seqLength - $maxLength;
$posTop = $pos + $seqLength - $minLength;
if ($chr ne $prevChr) {%chrData = ()}
$chrData{$pos}{$_} = 1;
foreach $tempPos (sort {$a<=>$b} keys %chrData) {
 if ($tempPos < $posMin) {delete $chrData{$tempPos}}
 elsif ($tempPos >= $posLow and $tempPos <= $posTop) {
 foreach $line1 (keys %{$chrData{$tempPos}}) {
  foreach $line2 (keys %{$chrData{$pos}}) {
   if ($line2 =~ /(.+)\t(\d+)\t(\S+)/) {$line2 = $1."\t".($2+length($3)-1)."\t".$3}
   print "$line1\t$line2\n"
}}}}
$prevChr = $chr;
}' test.txt

regarding your code:

  • exporting and retrieving $ENV{'MINL'} and $ENV{'MAXL'} for every line takes longer than defining them in a BEGIN section. I have included in this last modification.
  • if the output is not exactly what you need, you could consider integrating that awk inside the perl. I have included in this last modification too.
  • having to remove -2 from $posTop has only to do with what interval exactly you need to detect. you can specify it in the variable definitions ($posTop and $posLow) and/or in the interval check ($tempPos >= $posLow and $tempPos <= $posTop). in this last modification, just as an idea, I did something on the variable definitions, but instead of having to remove -1 twice on each line I preferred to do it only once at the BEGIN section.
  • not specifying the complete problem is usually not a good idea, but the reason why you did it deserves my congratulations. I hope you understand how this code goes and that you're able to play around with it so that you can finally solve your problems.
ADD REPLY

Login before adding your answer.

Traffic: 2111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6