Python - Scan Thorugh A Large File And Report Matches In Vcf File Under Conditions
6
1
Entering edit mode
11.4 years ago
stolarek.ir ▴ 700

I was trying to scan with positions from one file through positions in second file to find, if the features are overlaping between them. file a looks like: (typical vcf entries. Many of them)

chr1    1161692 chr1uGROUPERuDELu0u832  TGCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGGAAACATGGCACCTCCCCTCTGGG T   63
chr1    249174066 chr1uGROUPERuDELu0u832  TGCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGGAAACATGGCACCTCCCCTCTGGG T   63
chr1    249175897 chr1uGROUPERuDELu0u832  TGCTCTTTCCAGAAACCCTCAACCCTGTACGGTCAGGAGGAAACATGGCACCTCCCCTCTGGG T   63

I have a file Pt looking like this(tab delimited):

chr1    249174065 249174067
chr1    249175897    249175899

I wrote:

 for line in a:
        line = line.strip().split()
        for row in masterlist:
            row = row.strip().split()
            w=[]
            if (line[0] == row[0]):
                f = range(int(row[1]),int(row[2]))
                w.append(line[1])
                for i in w:
                    i = int(i)
                    if i in f:
                        print line
                    else:
                        break
            else:
                break

There is a problem now.

These both entries in Pt file should be a match. But the script only reports the first ontry from Pt file. If the first entry is not matched, the output is none. I want the script to output all matches

python vcf • 7.2k views
ADD COMMENT
1
Entering edit mode

Do you just want to extract subsets of the vcf file that are within certain regions? You could just use vcf-query

ADD REPLY
0
Entering edit mode

yes I do. but number of these regions is quite high. Can i pass the file with these regions to the vcf-query ?

ADD REPLY
1
Entering edit mode
11.4 years ago

While an expressive and beautiful language, Python can be a bit slow, especially at any work that involves I/O. A faster way to do this might be to convert to BED and use a dedicated set operation binary to quickly search for overlaps, e.g. given the file MySnps.vcf and the sorted BED file Pt.bed:

$ vcf2bed < MySnps.vcf \
    | bedops --element-of -1 - Pt.bed \
    > answer.bed

The file answer.bed will contain SNPs from the file MySnps.vcf that overlap regions in Pt.bed by one or more bases. (This overlap criterium can be adjusted, if needed.)

ADD COMMENT
0
Entering edit mode

this is a useful tools. It's always nice to practice code, but if I need a solution fast, that will probably be the best option for future

ADD REPLY
0
Entering edit mode
11.4 years ago
KCC ★ 4.1k

The issue as far as I can tell is the second " else: break: line.

 for line in a:
        line = line.strip().split()
        for row in masterlist:
            row = row.strip().split()
            w=[]
            if (line[0] == row[0]):
                f = range(int(row[1]),int(row[2]))
                w.append(line[1])
                for i in w:
                    i = int(i)
                    if i in f:
                        print line
                    else:
                        break
            else:
                break

Based on what you've written, the first time that the condition (line[0] == row[0]) fails, your script stops going through the elements of masterlist. I think you want to change you script to this:

 for line in a:
        line = line.strip().split()
        for row in masterlist:
            row = row.strip().split()
            w=[]
            if (line[0] == row[0]):
                f = range(int(row[1]),int(row[2]))
                w.append(line[1])
                for i in w:
                    i = int(i)
                    if i in f:
                        print line
                    else:
                        break

The second major issue is how you look to see if one interval is inside of another. You seem to be checking if the first coordinate of line, line[1] is in a list created from the coordinates of row, range(int(row[1]),int(row[2])) . It's quite inefficient to do it this way. Also, from what I see here w can only ever have one element, so why iterate over the elements of w. You should use tests to see if line[1] is more than or equal to row[1] but less than row[2].

Replace:

        w=[]
        if (line[0] == row[0]):
            f = range(int(row[1]),int(row[2]))
            w.append(line[1])
            for i in w:
                i = int(i)
                if i in f:
                    print line
                else:
                    break

with: if ((line[0] == row[0]) & (int(row[1]) <= line[1]) & (line[1] < int(row[2])) ): print line

ADD COMMENT
0
Entering edit mode

is there a way to optimize this script? It takes about 40 minutes to complete only for 60 entries in vcf file. And I wanted to use it for multiple 20k entries files. The filtering is done against Pt file which is 3668852 positions

ADD REPLY
0
Entering edit mode
11.4 years ago
stolarek.ir ▴ 700
masterlist = [row for row in Pt]

for line in a:
    line = line.strip().split()
    for row in masterlist:
        row = row.strip().split()
        b = int(line[1])
        f = range(int(row[1]),int(row[2]))
        if (line[0] == row[0]):
            if b in f:
                print line

The code above is the solution (turned out that i wasted couple of hours on optimizing the code, when the problem was with wrong file name [used a - apythonvcf, instead of a - abcpythonvcf]. But still, is there a way to make it go faster? Files are really huge.

Remember all, never give stupidly similar names to files

ADD COMMENT
1
Entering edit mode

Try this:

masterlist = [row for row in Pt]
for line in a:
  line = line.split() # No need to strip it. You are not doing any comparison on the last element.
  b = int(line[1]) # Moved this to the outer loop. There is no need to repeat this operation!
  for row in masterlist:
    row = row.strip().split()
    if (line[0] == row[0]):
      # There is no need to generate the ranges. Just do a comparison to figure out if b is in between
      if int(line[1]) <= b and b <= int(line[2]) 
        print line

That should be a bit faster.

ADD REPLY
0
Entering edit mode
11.4 years ago
always_learning ★ 1.1k

You can used data structue here !! In similar cases I have used "Interval tree" data structure but in Perl. But I have seen similar implementation on Python also.

But here one thing you try also is make a "Hash of array" if its possible in Python. As I am not much aware about that.

ADD COMMENT
0
Entering edit mode
11.4 years ago

use the tabix bindings for python https://github.com/samtools/tabix/tree/master/python

ADD COMMENT
0
Entering edit mode

do you have some example that how you use tabix bindings for python for any such operations ?

ADD REPLY
0
Entering edit mode

I'm not a python guy but there is an example: https://github.com/samtools/tabix/blob/master/python/test.py

ADD REPLY
0
Entering edit mode
11.4 years ago
stolarek.ir ▴ 700

I will try using your suggestions and see if they are quicker than mine. For now I reduced the search space in the second file, which fasten the script much.

def round_down(num):
    return num - (num%100000)

vcf = open('/home/istolarek/HCC218vcf_intersect/a.vcf','r')
#vcf = open('/home/ptedder/chr3.vcf','r')
Pt = open('/home/istolarek/moleculoNA12878/nonPt','r')
#Pt = open('/home/ptedder/chr3.txt','r')
out = open('/home/istolarek/HCC218vcf_intersect/aPTout','w')
#platinum_region = [row for row in Pt]
platinum_region={}
platinum_region['chrM']={}
platinum_region['chrM'][0]=[]
ct=0
for region in Pt:
    (chr,start,end)= region.strip().split()
    start=int(start)
    end=int(end)
    rounded_start=round_down(start)

    if not (chr in platinum_region):
        platinum_region[chr]={}
    if not (rounded_start in platinum_region[chr]):
        platinum_region[chr][rounded_start]=[]
    platinum_region[chr][rounded_start].append({'start':start,'end':end})

c=0
for vcf_line in vcf:
    if (c % 1000 ==0):print "c ",c
    c=c+1
    vcf_data = vcf_line.strip().split()
    vcf_chrom=vcf_data[0]
    vcf_pos=int(vcf_data[1])
    rounded_vcf_position=round_down(vcf_pos)
    #print "vcfchrom ",vcf_chrom, " line ", vcf_line
    if vcf_chrom in platinum_region and rounded_vcf_position in platinum_region[vcf_chrom]:
        for region in platinum_region[vcf_chrom][rounded_vcf_position]:
            if vcf_pos in range(region['start'],region['end']):
##                print vcf_line
                out.write(str(vcf_line)+'\n')
out.close()
ADD COMMENT

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6