Convert Plink Ped Format Into Hapmap Format?
3
4
Entering edit mode
12.8 years ago

Is there a way to convert from ped PLINK format into HapMap genotype format? I've got PED PLINK files that I want to analyse with a program that requires HapMap genotype format (the examples below are from HapMap, so I am guessing it's doable to convert them):

From this example file ([...] means more of the same):

wget -qO- ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2009-01_phaseIII/plink_format/hapmap3_r2_b36_fwd.consensus.qc.poly.ped.bz2 | bunzip2 -c | head -n 1 | less

2427    NA19919 NA19908 NA19909 1       -9      C C     C C     C C     A A     G G     G G     C C     T T     G G   [...]   A A

To this file format ([...] means more of the same):

wget -qO- ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2009-01_phaseIII/hapmap_format/consensus/genotypes_chr10_ASW_phase3.2_consensus.b36_fwd.txt.gz | gunzip -c | head

center protLSID assayLSID panelLSID QCcode NA19625 NA19700 NA19701 NA19702 NA19703 NA19704 [...] NA20364
rs12255619 A/C chr10 88481 + ncbi_b36 bbs urn:lsid:bbs.hapmap.org:Protocol:Phase3_Draft1:1 urn:lsid:bbs.hapmap.org:Assay:Phase3_Draft1_rs12255619:1 urn:lsid:dcc.hapmap.org:Panel:US_African-30-trios:3 QC+ AA AA AC AA AA AA [...] AA

For more details on the HapMap format:

-HapMap file format:
The current release consists of text-table files only, with the following columns:

Col1: refSNP rs# identifier at the time of release (NB: it might merge 
  with another rs# in the future)
Col2: SNP alleles according to dbSNP
Col3: chromosome that SNP maps to 
Col4: chromosome position of SNP, in basepairs on reference sequence
Col5: strand of reference sequence that SNP maps to
Col6: version of reference sequence assembly (currently NCBI build36)
Col7: HapMap genotyping center that produced the genotypes
Col8: LSID for HapMap protocol used for genotyping
Col9: LSID for HapMap assay used for genotyping
Col10: LSID for panel of individuals genotyped
Col11: QC-code, currently 'QC+' for all entries (for future use)
Col12 and on: observed genotypes of samples, one per column, sample
     identifiers in column headers (Coriell catalog numbers, example:
      NA10847).
hapmap genotyping plink • 16k views
ADD COMMENT
0
Entering edit mode

Hi Pierre,

Have you ever try paraHaplo software? I would like to use it because of its capabality to multiprocessing with MPI. I have used your script converttpedto_hapmap.pl to convert my tped data file to hapmap format but I meet a problem.

When I'm trying to run the haplotypePhasing module with command "../../src/SNP/bin/haplotypePhasing.exe ../../data/test.tped.hapmap hblockdef.txt phasedout_test.txt 1 9331"

I have got such error "input file open error! FileName:phasedouttest.txt_output0"

Unfortunately I dont know what could be a real problem. Maybe it is paraHaplo problem or hapMap format generated by your scritp. So, I wondering if you have any experience with paraHaplo?

ADD REPLY
0
Entering edit mode

The files could also come from a vcf source:

~/src/tabix-0.2.6/bgzip -c test.100000.vcf > test.100000.vcf.gz ~/src/vcftools_0.1.8a/bin/vcftools --gzvcf test.100000.vcf.gz --plink-tped

Then converting the tped files into hapmap could be done with the answers in this biostar question.

ADD REPLY
5
Entering edit mode
12.8 years ago
Davy ▴ 210

Python and R will allow to transpose your data set. very easy in R

data <- t(data)

Not sure about python. Depends on the data structures your using but let's say an array, then

data = transpose(data)

Plink also gives you an option to transpose your ped files.

plink --file mydata --transpose --out mytransposeddata

That gives you snps in rows and individuals in columns.You can then just add in the SNP, chromosome, and bp columns from the map file that plink uses.

Again, if you are using R you can use biomaRt to look up the information of all the SNPs you have in your file. If you don't know R or biomaRt, UCSC genome browser would be a good bet. It can give you tons of infromation on a predefined list of SNPs, or even all SNPs within specified regions.

Some columns you will have to fill in yourself like the protocols and assays, but assuming that that wouldn't be an extensive list, it is still pretty easy in R.

Hope this helps somehow.

ADD COMMENT
5
Entering edit mode
12.8 years ago

Hello,

I have made a lot of file format conversion in the past between ped/HapMap/impute/beagle etc, and have a lot of code for this. I made scripts in the past that dealed with the transposition themselves but once plink became the standard tool for GWAS y delegate the transposition ped/tped to plink and then use ped or tped for the format conversion depending the orientation of the destination format.

For converting ped to HapMap I convert first to tped with plink:

plink --noweb --file hapmap3_r2_b36_fwd.consensus.qc.poly--100_inds --recode --transpose --out hapmap3_r2_b36_fwd.consensus.qc.poly--100_inds`

And then use my script convert_tped_to_hapmap.pl.

perl /soft_bio/PMG/perl/convert_tped_to_hapmap.pl --tped hapmap3_r2_b36_fwd.consensus.qc.poly--100_inds.tped --tfam hapmap3_r2_b36_fwd.consensus.qc.poly--100_inds.tfam --build=ncbi_36

I have made a gist where you can download the script


git://gist.github.com/2069211.git

exceute perldoc <path to the script> to see the documentation and how to download the files for testing it.

This script contains the basics of the conversion (I have removed some SNP QC checking and the strand qc that connects to my own databases.)

NOTE: For converting to HapMap I will suggest first to flip the SNPs in ped to forward strand, so you can put all SNPs in the HapMap with 'strand +', and also you need to decide how to create the individual name. Usually you will have unique names for your whole study so you don't need to concatenate FamilyID and IndividualID. When working on studies with unrelated individuals when all IID are unique, sometimes the FID is equal to the FID and if you concatenate them, the name in HapMap becomes a ugly tandem duplication. For this general conversion tool, I am concatenating FID and IID with '__' (double underscore).

ADD COMMENT
2
Entering edit mode
12.8 years ago

Ok, since nobody has answered this question I suppose that there is no specific for this. Here is how I would execute this task.

Use a key/value datastore (leveldb, berkeleydb, etc...), and fill the following tables.

once your tables are filled:

for(col-index:  the markers)
    for(indi-name: individuals)
         print get(indi-name).genotypes[col-index];
ADD COMMENT

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6