Hello,
I have a huge file with 0s and 1s. The position where there is a snp is given 1 and others are 0. Also, I have different populations in the same tsv file. My goal is to find signature SNPs present in a particular population (say population 1 is from column1 to column5). So I need the positions where 1 is present in all 5 columns for population 1 and 0s in all the other columns.
POS pop1 pop1 pop1 pop1 pop1 pop2 pop2 pop2 pop2 pop2 pop3 pop3 pop3 pop3 pop3 pop4 pop4 pop4 pop4 pop4
746 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
762 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0
So my result for the signature snp position for pop1 would be 746 and for pop3 would be 762. Are there any tools to do this ? I tried using awk for this but it gave an error. My awk command is as follows ;
$ awk -F '\t' '$1,$2,$3,$4.$5 ~ /1/' file.tsv
Is there a way where I can specify range of columns in awk pattern match ?
I hope my question is clear. Any help would be appreciated. Thanks in advance.
Is the number of column per population equal for all population or do this vary? Do you know this number before starting?
They not equal but I know the number before starting
You could start by summing up the numbers of
1
s per group and then filter for groups where the sum of1
s is identical to the number of elements per group like forpop1
take only rows where sum is 5. I strongly suggest to solve these kinds of things yourself, you will see it enhances your skills and enables you to abstract the knowledge to other problems. Stuff like that you'll probably encounter repetitively during your career.