I have a large PED file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.ped.gz) and a large MAP file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz) that I am trying to work with.
The slightly modified map file I have has >600,000 rows
1 rs3094315 0 742429
1 rs7419119 0 831876
1 rs13302957 0 880884
1 rs6696609 0 893289
1 rs8997 0 939517
1 rs9442372 0 1008567
1 rs147606383 0 1035194
1 rs4970405 0 1038818
1 rs11807848 0 1051029
1 rs4970421 0 1098500
1 rs1320571 0 1110294
1 rs2887286 0 1145994
1 rs79118541 0 1147410
1 rs3813199 0 1148140
1 rs113791678 0 1151643
1 rs78424188 0 1160450
1 rs12073590 0 1195018
1 rs6685064 0 1201155
1 rs61559999 0 1225655
1 rs60785581 0 1225708
The slightly modified ped has 942 rows, where each row is an individual and each column is a genotype correspond to the associated row of map file
HGDP00001 HGDP00001 0 0 1 0 AG GT AA CC GG AG GG AA
HGDP00003 HGDP00003 0 0 1 0 AA GT AA TT GG GG GG AA
HGDP00005 HGDP00005 0 0 1 0 AA TT AA CC GG GG GG AA
HGDP00007 HGDP00007 0 0 1 0 AA TT AA CC GG GG GG AA
HGDP00011 HGDP00011 0 0 1 0 AG GT AA CT GG AG GG AA
HGDP00013 HGDP00013 0 0 1 0 AG TT AA CC AG AG GG AA
HGDP00015 HGDP00015 0 0 1 0 AG GT AA CT AG GG GG AA
HGDP00017 HGDP00017 0 0 1 0 AG GT AA CC GG GG GG AA
HGDP00019 HGDP00019 0 0 1 0 AA TT AG CT GG GG GG AA
HGDP00021 HGDP00021 0 0 1 0 AA GT AA TT GG AA GG AA
HGDP00023 HGDP00023 0 0 1 0 AA GT AA TT GG AG GG AA
HGDP00025 HGDP00025 0 0 1 0 AA GT AA CT GG AG GG AA
HGDP00027 HGDP00027 0 0 1 0 AG GT AA CT AG AG GG AA
HGDP00029 HGDP00029 0 0 1 0 AA TT AA CC GG GG GG AA
HGDP00031 HGDP00031 0 0 1 0 AG GT AG CT GG GG GG AA
HGDP00033 HGDP00033 0 0 1 0 AA TT AG CC GG AA GG AG
HGDP00035 HGDP00035 0 0 1 0 AG TT AG CT GG AG GG AA
HGDP00037 HGDP00037 0 0 1 0 AG GT AA TT AG GG GG AG
I trying to get them into a single table with a similar formatting to some Affy array data I have (which has rows as SNP ids and columns as individuals).
I was wondering if anyone could help figure out a Python or Bash scripting solution to transpose the ped file such that the 1st row of the ped file becomes the 5th column of the map file, and the 2nd row of the ped becomes the 6th of the map file, and so on...
Basically, I want it to look like this (I presume subsequently taking out the 0/1 rows and location columns will be fairly simple?)
HGDP00001 HGDP00003 HGDP00005 HGDP00007 HGDP00011 HGDP00013 HGDP00015 HGDP00017 HGDP00019 HGDP00021 HGDP00023 HGDP00025 HGDP00027 HGDP00029 HGDP00031 HGDP00033 HGDP00035 HGDP00037 HGDP00039
HGDP00001 HGDP00003 HGDP00005 HGDP00007 HGDP00011 HGDP00013 HGDP00015 HGDP00017 HGDP00019 HGDP00021 HGDP00023 HGDP00025 HGDP00027 HGDP00029 HGDP00031 HGDP00033 HGDP00035 HGDP00037 HGDP00039
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 rs3094315 0 742429 AG AA AA AA AG AG AG AG AA AA AA AA AG AA AG AA AG AG AA
1 rs7419119 0 831876 GT GT TT TT GT TT GT GT TT GT GT GT GT TT GT TT TT GT TT
1 rs13302957 0 880884 AA AA AA AA AA AA AA AA AG AA AA AA AA AA AG AG AG AA AA
1 rs6696609 0 893289 CC TT CC CC CT CC CT CC CT TT TT CT CT CC CT CC CT TT CT
1 rs8997 0 939517 GG GG GG GG GG AG AG GG GG GG GG GG AG GG GG GG GG AG GG
1 rs9442372 0 1008567 AG GG GG GG AG AG GG GG GG AA AG AG AG GG GG AA AG GG GG
1 rs147606383 0 1035194 GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG
1 rs4970405 0 1038818 AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AG AA AG AA
R would make this super easy, no? Does the solution *have* to be in bash/python?
Doesn't have to be, it is just that I have more experience with bash/python. I do have some background in R