I have a Beagle phased output and I want to compare consecutive columns of a file and return the number of matched elements. I would prefer to use shell scripting or awk. Here is a sample bash/AWK script that I am trying to use.
!/bin/bash
for i in 3 4 5 6 7 8 9
do
for j in 3 4 5 6 7 8 9
do
awk "$i == $j" phased.txt | wc -l
done
done
I have a file of size 147189828 and I want to compare each columns and return the number of matched elements in a 828\828 matrix (A similarity matrix). This would be fairly easy in MATLAB, but, it takes a long time to load huge files. I can compare two columns and return the number of matched elements with the following awk command: awk '$3==$4' phased.txt | wc -l
, but would need some help to do it for the entire file.
A snippet of the data:
# sampleID HGDP00511 HGDP00511 HGDP00512 HGDP00512 HGDP00513 HGDP00513
M rs4124251 0 0 A G 0 A
M rs6650104 0 A C T 0 0
M rs12184279 0 0 G A T 0
..
..
Always show a snippet of data, as I have no idea what a phased beagle file is, but I can help you with comparison.
Hi Sukhdeep,
Thanks for reaching out. I have posted a snippet of the sample data. Your help is much appreciated.