Question

How Can I Divide Snp Data Into Fixed Windows Based On Physical Distance ?

0

Entering edit mode

13.2 years ago

Rubal ▴ 350

Hi all,

I have a tab-delimited text file of SNP data that I need to split into smaller files, with each file containing data from SNPs in 20mb windows. My problem is how to split the files conditional on the numerical value in one of the columns.

File format:

SNP ID     Physical distance
rs_123132  12343 
rs_123134  304354
rs_123434  8930044

I need a way to keep track of the distance between values in column 2 and when it becomes >= 20,000,000 to export all the rows within this block into a new file, and to do this for each block of 20,000,000 until the end of the file.

If possible I'd love to see this done in Python, as this is the language I am learning.

Thanks very much for any help!

Rubal

snp python • 2.5k views

ADD COMMENT • link updated 13.2 years ago by brentp 24k • written 13.2 years ago by Rubal ▴ 350

score 1 · Answer 1 · 2011-09-22

1

Entering edit mode

13.2 years ago

brentp 24k

I'm a bit confused as to wether your 2nd column is the rs location, or the distance. Below, I assume it's the location, and you want all SNPs with location < 20million in one file, then SNPS between 20 and 40 million in another, and so on. (I ignore chromosome, since you seem to have done so also).

import sys
file_iter = (x.strip().split("\t") for x in open(sys.argv[1]))
file_iter.next() # drop header

files = {}   
SPLIT = 20000000

for rsid, start in file_iter:
    (n, rem) = divmod(int(start), SPLIT)
    if not n in files:
        files[n] = open('snps.%i.txt' % (n * SPLIT), 'w')
    print >> files[n], "%s\t%s" % (rsid, start)

for fh in files.values(): fh.close()

Call this like:

python splitter.py your-snps.txt

and it will create files like: snps.0.txt, snps.20000000000.txt, etc.

ADD COMMENT • link 13.2 years ago by brentp 24k

0

Entering edit mode

Yes it's the physical rs location. Thanks! I'll try this out and let you know if it works for me.

ADD REPLY • link 13.2 years ago by Rubal ▴ 350

0

Entering edit mode

I get the following error (perhaps I have compiled it incorrectly?):

Traceback (most recent call last):

File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: need more than 1 value to unpack

ADD REPLY • link 13.2 years ago by Rubal ▴ 350

0

Entering edit mode

you have to call it with the name of your snps file as the first argument. that value indicates that your snp file is empty or it is not tab delimited. if it is not tab delimted. use .split() in place of .split("t")

ADD REPLY • link 13.2 years ago by brentp 24k

0

Entering edit mode

Thanks for the feedback, ironically now I fixed that issue I seem to get the opposite message (thanks for your patience):

Traceback (most recent call last): File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: too many values to unpack

ADD REPLY • link 13.2 years ago by Rubal ▴ 350

0

Entering edit mode

so your columns are separated by multiple spaces. fix that, or use re.split("s+", x) instead of x.split()

ADD REPLY • link 13.2 years ago by brentp 24k