Hi all,
I have a tab-delimited text file of SNP data that I need to split into smaller files, with each file containing data from SNPs in 20mb windows. My problem is how to split the files conditional on the numerical value in one of the columns.
File format:
SNP ID Physical distance
rs_123132 12343
rs_123134 304354
rs_123434 8930044
I need a way to keep track of the distance between values in column 2 and when it becomes >= 20,000,000 to export all the rows within this block into a new file, and to do this for each block of 20,000,000 until the end of the file.
If possible I'd love to see this done in Python, as this is the language I am learning.
Thanks very much for any help!
Rubal
Yes it's the physical rs location. Thanks! I'll try this out and let you know if it works for me.
I get the following error (perhaps I have compiled it incorrectly?):
Traceback (most recent call last):
File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: need more than 1 value to unpack
you have to call it with the name of your snps file as the first argument. that value indicates that your snp file is empty or it is not tab delimited. if it is not tab delimted. use .split() in place of .split("t")
Thanks for the feedback, ironically now I fixed that issue I seem to get the opposite message (thanks for your patience):
Traceback (most recent call last): File "makewindows.py", line 10, in <module> for rsid, start in file_iter: ValueError: too many values to unpack
so your columns are separated by multiple spaces. fix that, or use re.split("s+", x) instead of x.split()