Question

Python - Scan Thorugh A Large File And Report Matches In Vcf File Under Conditions

1

Entering edit mode

11.4 years ago

stolarek.ir ▴ 700

I was trying to scan with positions from one file through positions in second file to find, if the features are overlaping between them. file a looks like: (typical vcf entries. Many of them)

chr1    1161692 chr1uGROUPERuDELu0u832  TGCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGGAAACATGGCACCTCCCCTCTGGG T   63
chr1    249174066 chr1uGROUPERuDELu0u832  TGCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGGAAACATGGCACCTCCCCTCTGGG T   63
chr1    249175897 chr1uGROUPERuDELu0u832  TGCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGGAAACATGGCACCTCCCCTCTGGG T   63

I have a file Pt looking like this(tab delimited):

chr1    249174065 249174067
chr1    249175897    249175899

I wrote:

 for line in a:
        line = line.strip().split()
        for row in masterlist:
            row = row.strip().split()
            w=[]
            if (line[0] == row[0]):
                f = range(int(row[1]),int(row[2]))
                w.append(line[1])
                for i in w:
                    i = int(i)
                    if i in f:
                        print line
                    else:
                        break
            else:
                break

There is a problem now.

These both entries in Pt file should be a match. But the script only reports the first ontry from Pt file. If the first entry is not matched, the output is none. I want the script to output all matches

python vcf • 7.1k views

ADD COMMENT • link 11.4 years ago by stolarek.ir ▴ 700

1

Entering edit mode

Do you just want to extract subsets of the vcf file that are within certain regions? You could just use vcf-query

ADD REPLY • link 11.4 years ago by Damian Kao 16k

0

Entering edit mode

yes I do. but number of these regions is quite high. Can i pass the file with these regions to the vcf-query ?

ADD REPLY • link 11.4 years ago by stolarek.ir ▴ 700

score 1 · Answer 1 · 2013-07-17

1

Entering edit mode

11.4 years ago

Alex Reynolds 36k

While an expressive and beautiful language, Python can be a bit slow, especially at any work that involves I/O. A faster way to do this might be to convert to BED and use a dedicated set operation binary to quickly search for overlaps, e.g. given the file MySnps.vcf and the sorted BED file Pt.bed:

$ vcf2bed < MySnps.vcf \
    | bedops --element-of -1 - Pt.bed \
    > answer.bed

The file answer.bed will contain SNPs from the file MySnps.vcf that overlap regions in Pt.bed by one or more bases. (This overlap criterium can be adjusted, if needed.)

ADD COMMENT • link 11.3 years ago by Alex Reynolds 36k

0

Entering edit mode

this is a useful tools. It's always nice to practice code, but if I need a solution fast, that will probably be the best option for future

ADD REPLY • link 11.4 years ago by stolarek.ir ▴ 700

score 0 · Answer 2 · 2013-07-17

The issue as far as I can tell is the second " else: break: line.

 for line in a:
        line = line.strip().split()
        for row in masterlist:
            row = row.strip().split()
            w=[]
            if (line[0] == row[0]):
                f = range(int(row[1]),int(row[2]))
                w.append(line[1])
                for i in w:
                    i = int(i)
                    if i in f:
                        print line
                    else:
                        break
            else:
                break

Based on what you've written, the first time that the condition (line[0] == row[0]) fails, your script stops going through the elements of masterlist. I think you want to change you script to this:

 for line in a:
        line = line.strip().split()
        for row in masterlist:
            row = row.strip().split()
            w=[]
            if (line[0] == row[0]):
                f = range(int(row[1]),int(row[2]))
                w.append(line[1])
                for i in w:
                    i = int(i)
                    if i in f:
                        print line
                    else:
                        break

The second major issue is how you look to see if one interval is inside of another. You seem to be checking if the first coordinate of line, line[1] is in a list created from the coordinates of row, range(int(row[1]),int(row[2])) . It's quite inefficient to do it this way. Also, from what I see here w can only ever have one element, so why iterate over the elements of w. You should use tests to see if line[1] is more than or equal to row[1] but less than row[2].

Replace:

        w=[]
        if (line[0] == row[0]):
            f = range(int(row[1]),int(row[2]))
            w.append(line[1])
            for i in w:
                i = int(i)
                if i in f:
                    print line
                else:
                    break

with: if ((line[0] == row[0]) & (int(row[1]) <= line[1]) & (line[1] < int(row[2])) ): print line

score 0 · Answer 3 · 2013-07-17

0

Entering edit mode

11.4 years ago

stolarek.ir ▴ 700

masterlist = [row for row in Pt]

for line in a:
    line = line.strip().split()
    for row in masterlist:
        row = row.strip().split()
        b = int(line[1])
        f = range(int(row[1]),int(row[2]))
        if (line[0] == row[0]):
            if b in f:
                print line

The code above is the solution (turned out that i wasted couple of hours on optimizing the code, when the problem was with wrong file name [used a - apythonvcf, instead of a - abcpythonvcf]. But still, is there a way to make it go faster? Files are really huge.

Remember all, never give stupidly similar names to files

ADD COMMENT • link 11.4 years ago by stolarek.ir ▴ 700

1

Entering edit mode

Try this:

masterlist = [row for row in Pt]
for line in a:
  line = line.split() # No need to strip it. You are not doing any comparison on the last element.
  b = int(line[1]) # Moved this to the outer loop. There is no need to repeat this operation!
  for row in masterlist:
    row = row.strip().split()
    if (line[0] == row[0]):
      # There is no need to generate the ranges. Just do a comparison to figure out if b is in between
      if int(line[1]) <= b and b <= int(line[2]) 
        print line

That should be a bit faster.

ADD REPLY • link 11.4 years ago by David Westergaard ★ 1.5k

score 0 · Answer 4 · 2013-07-18

0

Entering edit mode

11.4 years ago

always_learning ★ 1.1k

You can used data structue here !! In similar cases I have used "Interval tree" data structure but in Perl. But I have seen similar implementation on Python also.

But here one thing you try also is make a "Hash of array" if its possible in Python. As I am not much aware about that.

ADD COMMENT • link 11.4 years ago by always_learning ★ 1.1k

score 0 · Answer 5 · 2013-07-18

0

Entering edit mode

11.4 years ago

Pierre Lindenbaum 164k

use the tabix bindings for python https://github.com/samtools/tabix/tree/master/python

ADD COMMENT • link 11.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

do you have some example that how you use tabix bindings for python for any such operations ?

ADD REPLY • link 10.2 years ago by always_learning ★ 1.1k

0

Entering edit mode

I'm not a python guy but there is an example: https://github.com/samtools/tabix/blob/master/python/test.py

ADD REPLY • link 10.2 years ago by Pierre Lindenbaum 164k

score 0 · Answer 6 · 2013-07-18

I will try using your suggestions and see if they are quicker than mine. For now I reduced the search space in the second file, which fasten the script much.

def round_down(num):
    return num - (num%100000)

vcf = open('/home/istolarek/HCC218vcf_intersect/a.vcf','r')
#vcf = open('/home/ptedder/chr3.vcf','r')
Pt = open('/home/istolarek/moleculoNA12878/nonPt','r')
#Pt = open('/home/ptedder/chr3.txt','r')
out = open('/home/istolarek/HCC218vcf_intersect/aPTout','w')
#platinum_region = [row for row in Pt]
platinum_region={}
platinum_region['chrM']={}
platinum_region['chrM'][0]=[]
ct=0
for region in Pt:
    (chr,start,end)= region.strip().split()
    start=int(start)
    end=int(end)
    rounded_start=round_down(start)

    if not (chr in platinum_region):
        platinum_region[chr]={}
    if not (rounded_start in platinum_region[chr]):
        platinum_region[chr][rounded_start]=[]
    platinum_region[chr][rounded_start].append({'start':start,'end':end})

c=0
for vcf_line in vcf:
    if (c % 1000 ==0):print "c ",c
    c=c+1
    vcf_data = vcf_line.strip().split()
    vcf_chrom=vcf_data[0]
    vcf_pos=int(vcf_data[1])
    rounded_vcf_position=round_down(vcf_pos)
    #print "vcfchrom ",vcf_chrom, " line ", vcf_line
    if vcf_chrom in platinum_region and rounded_vcf_position in platinum_region[vcf_chrom]:
        for region in platinum_region[vcf_chrom][rounded_vcf_position]:
            if vcf_pos in range(region['start'],region['end']):
##                print vcf_line
                out.write(str(vcf_line)+'\n')
out.close()