How Can I Divide Snp Data Into Fixed Windows Based On Physical Distance ?
1
0
Entering edit mode
13.2 years ago
Rubal ▴ 350

Hi all,

I have a tab-delimited text file of SNP data that I need to split into smaller files, with each file containing data from SNPs in 20mb windows. My problem is how to split the files conditional on the numerical value in one of the columns.

File format:

SNP ID     Physical distance
rs_123132  12343 
rs_123134  304354
rs_123434  8930044

I need a way to keep track of the distance between values in column 2 and when it becomes >= 20,000,000 to export all the rows within this block into a new file, and to do this for each block of 20,000,000 until the end of the file.

If possible I'd love to see this done in Python, as this is the language I am learning.

Thanks very much for any help!

Rubal

snp python • 2.5k views
ADD COMMENT
1
Entering edit mode
13.2 years ago
brentp 24k

I'm a bit confused as to wether your 2nd column is the rs location, or the distance. Below, I assume it's the location, and you want all SNPs with location < 20million in one file, then SNPS between 20 and 40 million in another, and so on. (I ignore chromosome, since you seem to have done so also).

import sys
file_iter = (x.strip().split("\t") for x in open(sys.argv[1]))
file_iter.next() # drop header

files = {}   
SPLIT = 20000000

for rsid, start in file_iter:
    (n, rem) = divmod(int(start), SPLIT)
    if not n in files:
        files[n] = open('snps.%i.txt' % (n * SPLIT), 'w')
    print >> files[n], "%s\t%s" % (rsid, start)

for fh in files.values(): fh.close()

Call this like:

python splitter.py your-snps.txt

and it will create files like: snps.0.txt, snps.20000000000.txt, etc.

ADD COMMENT
0
Entering edit mode

Yes it's the physical rs location. Thanks! I'll try this out and let you know if it works for me.

ADD REPLY
0
Entering edit mode

I get the following error (perhaps I have compiled it incorrectly?):

Traceback (most recent call last):

File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: need more than 1 value to unpack

ADD REPLY
0
Entering edit mode

you have to call it with the name of your snps file as the first argument. that value indicates that your snp file is empty or it is not tab delimited. if it is not tab delimted. use .split() in place of .split("t")

ADD REPLY
0
Entering edit mode

Thanks for the feedback, ironically now I fixed that issue I seem to get the opposite message (thanks for your patience):

Traceback (most recent call last): File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: too many values to unpack

ADD REPLY
0
Entering edit mode

so your columns are separated by multiple spaces. fix that, or use re.split("s+", x) instead of x.split()

ADD REPLY

Login before adding your answer.

Traffic: 2098 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6