Gen file id reordering
1
0
Entering edit mode
9.4 years ago
maroisf ▴ 10

Hello Biostars,

I recently imputed genotyped dataset using SHAPEIT and IMPUTE2. The resulting files are of the gen/sample format (Oxford). Because I am using the latest 1000 genome reference dataset I end up with massive gen files: chr is ~130 Gb. I would like to reorder the ids in the produced gen/sample file. looked in software such as GTOOL or QCTOOL for such options but I cannot find any.

Is anyone aware of a software that could do such a thing or do I have to code something myself?

Thank you

François

Imputation genome SNP • 2.6k views
ADD COMMENT
0
Entering edit mode
9.4 years ago
biocyberman ▴ 870

Assuming the gene ids are at the first column, and the file is plain text. You can do this in linux:

#extract the header line:
head -n 1 unsorted.file.txt >header.txt

# test first 30 lines and see how it looks
head -n 30|sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1  >sorted.file.tsv

# read more about 'sort' command if neccessary to modify the output.
man sort

# final sort command:
sed -e'1d' |sort --human-numeric-sort --tempary-directory ./tmp --parallel 10 --key 1  >sorted.file.tsv

# prepend the header line with sed:
sed -i -e "1i $(cat header.txt)" sorted.file.tsv
ADD COMMENT
0
Entering edit mode

Thank you for your answer.

I'm sorry I think I was not clear on my question.

Here is an explanation taken from the oxford website on Gen file format:

Suppose you want to create a genotype for 2 individuals at 5 SNPs whose genotypes are

SNP 1 : AA AA
SNP 2 : GG GT
SNP 3 : CC CT
SNP 4 : CT CT
SNP 5 : AG GG

The correct gen file would be

SNP1 rs1 1000 A C 1 0 0 1 0 0
SNP2 rs2 2000 G T 1 0 0 0 1 0
SNP3 rs3 3000 C T 1 0 0 0 1 0
SNP4 rs4 4000 C T 0 1 0 0 1 0
SNP5 rs5 5000 A G 0 1 0 0 0 1

Along with the Gen file is a SAMPLE file :

ID_1 ID_2 missing cov_1 cov_2 cov_3 cov_4 pheno1 bin1
0 0 0 D D C C P B
1 1 0.007 1 2 0.0019 -0.008 1.233 1
2 2 0.009 1 2 0.0022 -0.001 6.234 0

The sample file ID (person) order corresponds to the Gen file column order. Also each ID of the sample file is associated to 3 columns in the Gen file. Therefore in this example the column 6,7 and 8 of the Gen file correspond to the ID 1 and columns 9,10 and 11 correspond to the ID 2.

My problem is that I would like to change the order of the ID in the sample file and hence the column order in the Gen file. Keeping in mind that I have over 3000 IDS in my sample file and over 6 000 000 lines X 10 000 columns in my gen file.

Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6