I recently imputed genotyped dataset using SHAPEIT and IMPUTE2. The resulting files are of the gen/sample format (Oxford). Because I am using the latest 1000 genome reference dataset I end up with massive gen files: chr is ~130 Gb. I would like to reorder the ids in the produced gen/sample file. looked in software such as GTOOL or QCTOOL for such options but I cannot find any.
Is anyone aware of a software that could do such a thing or do I have to code something myself?
Assuming the gene ids are at the first column, and the file is plain text. You can do this in linux:
#extract the header line:
head -n 1 unsorted.file.txt >header.txt
# test first 30 lines and see how it looks
head -n 30|sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1 >sorted.file.tsv
# read more about 'sort' command if neccessary to modify the output.
man sort
# final sort command:
sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1 >sorted.file.tsv
# prepend the header line with sed:
sed -i -e "1i $(cat header.txt)" sorted.file.tsv
Here is an explanation taken from the oxford website on Gen file format:
Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are
SNP 1 : AA AA
SNP 2 : GG GT
SNP 3 : CC CT
SNP 4 : CT CT
SNP 5 : AG GG
The correct gen file would be
SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1
Along with the Gen file is a SAMPLE file :
ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0.007 1 2 0.0019 -0.008 1.233 1
2 2 0.009 1 2 0.0022 -0.001 6.234 0
The sample file ID (person) order corresponds to the Gen file column order. Also each ID of the sample file is associated to 3 columns in the Gen file. Therefore in this example the column 6,7 and 8 of the Gen file correspond to the ID 1 and columns 9,10 and 11 correspond to the ID 2.
My problem is that I would like to change the order of the ID in the sample file and hence the column order in the Gen file. Keeping in mind that I have over 3000 IDS in my sample file and over 6 000 000 lines X 10 000 columns in my gen file.
Thank you
ADD REPLY
• link
updated 23 months ago by
Ram
44k
•
written 9.4 years ago by
maroisf
▴
10
Thank you for your answer.
I'm sorry I think I was not clear on my question.
Here is an explanation taken from the oxford website on Gen file format:
Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are
The correct gen file would be
Along with the Gen file is a SAMPLE file :
The sample file ID (person) order corresponds to the Gen file column order. Also each ID of the sample file is associated to 3 columns in the Gen file. Therefore in this example the column 6,7 and 8 of the Gen file correspond to the ID 1 and columns 9,10 and 11 correspond to the ID 2.
My problem is that I would like to change the order of the ID in the sample file and hence the column order in the Gen file. Keeping in mind that I have over 3000 IDS in my sample file and over 6 000 000 lines X 10 000 columns in my gen file.
Thank you