Intermediate range calculation from files
1
0
Entering edit mode
8.2 years ago
User 6777 ▴ 20

Hi all,

I have started to learn perl and python but now I am completely stuck in this problem, thus I seek your help.

I have seven files with different number ranges. I want to compare their ranges and detect the common range from them. Below I have shown an example with three files (file1.txt, file2.txt anf file3.txt). These files are like:

file1.txt:

68476204: 9-50, 55-75, 80-132
NC_23987: 2-22, 1001-1085
68473073: 1-8
68485121: 1-10, 20-55

file2.txt:

68485121: 15-45
45905121: 2-98, 201-255
68476204: 8-30, 57-77, 88-180
NC_23987: 1-18, 1021-1055
68473073: 14-44

file3.txt:

68485121: 16-42
68476204: 8-22, 55-76, 81-118

From here, I want to generate two output. First one is the common ranges (common in all three) after matching left column id values. For the above input, my output1.txt will be:

68485121: 20-42
68476204: 9-22, 57-75, 88-118

The second output (output2.txt) contain only those ranges those are >=15. Here, the output2.txt will be:

68485121: 20-42
68476204: 57-75, 88-118

Any type of suggestion is appreciated.

Thanks

perl python • 1.8k views
ADD COMMENT
0
Entering edit mode

How you calculated output1.txt? Could you explain ?

ADD REPLY
0
Entering edit mode
8.2 years ago

Convert your text files to BED files, sort them with BEDOPS sort-bed and run BEDOPS bedops --intersect on them to get the intervals common to them.

For example, file1.txt:

68476204: 9-50, 55-75, 80-132
NC_23987: 2-22, 1001-1085
68473073: 1-8
68485121: 1-10, 20-55

becomes file1.bed (when sorted):

68473073   1    8
68476204   9    50
68476204   55   75
68476204   80   132
68485121   1    10
68485121   20   55
NC_23987   2    22
NC_23987   1001 1085

And so on.

To convert from text to BED, you could use a Python script:

#!/usr/bin/env python                                                                                                                                                                   

import sys

for line in sys.stdin:
    (chr, intervals_str) = line.rstrip('\n').split(':')
    for interval in intervals_str.replace(' ', '').split(','):
        (start, stop) = interval.split('-')
        sys.stdout.write('%s\t%s\t%s\n' % (chr, start, stop))

Then:

$ convert.py < file1.txt | sort-bed - > file1.bed

Etc.

Once you have BED files, you can do set operations on the sorted BED files:

$ bedops --intersect file1.bed file2.bed file3.bed > answer.bed

Once you have the intersection of intervals, you can filter that result with awk based on interval length:

$ awk '($3-$2 >= 15)' answer.bed > filtered_answer.bed

Etc.

ADD COMMENT
0
Entering edit mode

thanks for your reply.. but as I am working in a windows machine, I will prefer a python/perl alternative.thank you

ADD REPLY
0
Entering edit mode

No problem. Good luck.

ADD REPLY

Login before adding your answer.

Traffic: 2218 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6