Question

conditional replacing rows with 9

0

Entering edit mode

7.3 years ago

Ana ▴ 200

I have a directory containing nearly 11 million small SNPs files: like this

wa_filtering_DP15_good_pops_snps_file_1
wa_filtering_DP15_good_pops_snps_file_2
.
.
.
wa_filtering_DP15_good_pops_snps_file_11232111

and each file has only 2 rows (first row allele count for the wild allele and second row allele count for the alternative allele) and 315 columns looks like this:

1   0   0   0   0   0   0   0   0   0   1   2   1   
0   0   0   0   0   0   0   0   0   0   0   0   0

I want to go through each file and if in each column both rows have 0 values replace them with 9 and get something like this:

1   9   9   9   9   9   9   9   9   9   1   2   1   
0   9   9   9   9   9   9   9   9   9   0   0   0

Can someone help me out to figure out how to do that? Thanks

bash text-processing • 1.6k views

ADD COMMENT • link updated 7.3 years ago by st.ph.n ★ 2.7k • written 7.3 years ago by Ana ▴ 200

score 2 · Accepted Answer · 2017-09-20

2

Entering edit mode

7.3 years ago

Pierre Lindenbaum 164k

find . -type f -name "wa_filtering_DP15_good_pops_snps_fi*" | while read F;
do 
    awk 'NR==1 { split($0,a);next;} NR==2 {split($0,b);for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:a[i]);printf("\n");;for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:b[i]);printf("\n");} ' $F > "$F.new"
done

ADD COMMENT • link 7.3 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks so much @Pierre Lindenbaum, your code worked as well

ADD REPLY • link 7.3 years ago by Ana ▴ 200

score 2 · Accepted Answer · 2017-09-20

2

Entering edit mode

7.3 years ago

st.ph.n ★ 2.7k

#!/usr/bin/env python
import sys
print 'File Number: ' + sys.argv[1].split('_')[-1], '\r', 
with open(sys.argv[1], 'r') as f:
    x = next(f).strip().split('\t')
    y = next(f).strip().split('\t')

with open(sys.argv[1] + '.w9', 'w') as out:
    for n in range(len(x)):
        if x[n] == '0' and y[n] == '0':
            x[n] = '9'
            y[n] = '9'
    out.write('\t'.join(x))
    out.write('\n')
    out.write('\t'.join(y))

Output

1       9       9       9       9       9       9       9       9       9       1       2       1
0       9       9       9       9       9       9       9       9       9       0       0       0

save as replace_w_9.py, run as for file in wa_filtering_DP15_good_pops_snps_file_*; do python replace_w_9.py $file; done

ADD COMMENT • link 7.3 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks so much @st.ph.n. yes it worked. I have an additional question. I have actually a SNPs file contains the allele counts across populations of each SNP are represented by two lines in the file, with the counts of allele one on the first line and the counts for second allele on the second. The example that I showed you above is allele count of the first SNP (lines 1 and 2). At first I thought I can split files for each SNP and run your python code for each file but I think it will be very complicated. How can I apply your python code on the entire SNPs file? is there any chance to run it on the entire data for each SNP instead of splitting the entire data into small SNPs file and run it for each SNP file? Thanks so much

ADD REPLY • link 7.3 years ago by Ana ▴ 200

0

Entering edit mode

OK, I found a solution for that! I just wrote this little bash script that splits the snpfile, runs your python script for each file, merges them together as a single file and deletes split files in the end:

#!/bin/bash

##directions
ROOT_DIR=/data/sh/H/lfmm/10K_random_SNPs_good_pop_LFMM_format/
FILE_DIR=${ROOT_DIR}/prep.lfmm.geno.file.test1

## locate files
INPUT=${FILE_DIR}/BayEnv_SNPSfile_random_1.tab.table
SCRIPT=${FILE_DIR}/replace.py 
OUTPUT=${FILE_DIR}/lfmm.part1

split -l 2 -a 5 -d ${INPUT} snp_batch_

for file in snp_batch_*;
do python ${SCRIPT} $file
done

cat snp_batch_*.w9 >> ${OUTPUT}

rm -f snp_batch*
rm -f snp_batch_*.w9enter code here