Perl/Python script: phased vcf to phased tped
2
0
Entering edit mode
6.1 years ago
Shicheng Guo ★ 9.6k

Hi All,

Who can share a perl/python script to transfer phased vcf to phased tped?

Thanks.


Update: plink will re-order the alleles therefore 'phase' status will be broken if plink was used in the data processing. Thanks for the explanation to it: the order in which the alleles appear in heterozygous genotype calls is usually determined by which allele is major/minor in the immediate dataset; this ordering will not vary between samples

vcf ped phased • 4.0k views
ADD COMMENT
3
Entering edit mode

Not answering your question but you have perhaps seen this.

ADD REPLY
2
Entering edit mode

This is the correct 'answer'. If you care about representing genotype phase in text, use VCF.

ADD REPLY
0
Entering edit mode

Yes. I think there should be some wheels outside.

ADD REPLY
0
Entering edit mode

Wheels? That word does not make sense in this context. Could you explain using different words, maybe?

ADD REPLY
0
Entering edit mode

Quote from PLINK docs:

The PED file is a white-space (space or tab) delimited file: the first six columns are mandatory:
     Family ID
     Individual ID
     Paternal ID
     Maternal ID
     Sex (1=male; 2=female; other=unknown)
     Phenotype

PED files do not hold genotype, phased or not, information. Are you sure you're asking the right question?

ADD REPLY
0
Entering edit mode

PED files do hold genotypes, see https://www.cog-genomics.org/plink2/formats#ped

ADD REPLY
0
Entering edit mode

There are two prevalent PED formats - the one used/generated by plink has genotype information after the first six columns. The subset of this file with the first six columns alone is used in other tools, such as GATK's PhaseByTransmission, etc, and is the more prevalent one for clinical genetics usage. PLINK calls this format the .fam file.

ADD REPLY
0
Entering edit mode

Oops! Got confused with the .fam files. Thanks for the info!

ADD REPLY
5
Entering edit mode
6.1 years ago

plink 1.9's core only handles .bed files. So --vcf causes a temporary .bed file to be generated, which does not contain any phase information. When a ped/tped is then exported from the .bed, the order in which the alleles appear in heterozygous genotype calls is usually determined by which allele is major/minor in the immediate dataset; this ordering will not vary between samples, and has nothing to do with the original phase status.

ADD COMMENT
1
Entering edit mode

Moved this comment to an answer to make it clearer that there is incorrect advice in other answers.

ADD REPLY
0
Entering edit mode

Great. Thanks Chris. Now. I see. That means plink changes the orders to keep the code for each individual is like same with in minor/major allele. Are you share about the vcftools --tped is same as what you said? Thanks.

ADD REPLY
1
Entering edit mode

It's basically irrelevant what vcftools --tped does, because phase is undefined in the tped format. You're effectively inventing your own file format and can't count on any software support from anyone else; much better to just write software that understands VCF, if you have to deal with text.

Meanwhile, please edit your top level answer to make it absolutely clear that it was incorrect.

ADD REPLY
0
Entering edit mode

Hi Chris, I think I will keep my post. I think my post is correct. Hope you can give further suggestion. I test it use 1000 genome data and use diff chr22.vcf.vcf.tped chr22.vcf.pl.tped to check the whole chr22. and it is totally same.

ADD REPLY
0
Entering edit mode

Could it be that just your example works by coincidence, but that the implementation (which chrchang523 obviously knows better than anyone else) does not guarantee phase information is preserved?

ADD REPLY
0
Entering edit mode

it should be not coincidence, the whole chr22 is totally allele order (phase status) in the tped compared with vcf. Let's wait for chrchang523's further comments. We will be the destination soon.

ADD REPLY
0
Entering edit mode

Part 2 is the one that matters, and I have already explained why that can't possibly work and your test must be flawed. plink is open source, and it is straightforward to verify that (i) .bed does not store phase info and (ii) the implementation of --recode only uses (temporary) .bed as input.

If you do not edit your answer within 24 hours, I will delete it.

ADD REPLY
0
Entering edit mode

Okay. I respect your suggestion and removed plink part. Just keep the 'vcftools --tped' part.

ADD REPLY
0
Entering edit mode

okay, maybe we can delete the whole post.

ADD REPLY
0
Entering edit mode

If you aren't going to delete the post, you need to explicitly mention that the plink test failed, after debugging your test if need be. It's the vcftools result that is meaningless, and can be deleted with no loss to anyone, since .tped is a plink file format; that's why the vcftools flag is called --plink-tped.

ADD REPLY
0
Entering edit mode
6.1 years ago
Shicheng Guo ★ 9.6k

Done. Just Share with you guys. I conducted a test on 1000 Genome chr22.

  1. transfer phased vcf to tped, the tped will keep the phase status, rigtht? Yes. it keeps the phased status

    vcftools --vcf test.vcf --plink-tped --out out

  2. use plink to creat tped, failed, yes. plink will re-order the alleles

    plink --vcf test.vcf --tped --out out

ADD COMMENT
2
Entering edit mode

This is incorrect, and you should mark it as such.

ADD REPLY
0
Entering edit mode

Hi Chris, Can you show us some details when you coding the plinks to convert vcf to tped? Thanks. At least, from my small test dataset, I found the phase status is kept. However, it will be great if you can tell us some details about the plink when you coding. Thanks.

Let's take tped as example, since in the ped, it will be easy to shown.

ADD REPLY
1
Entering edit mode

Did it work?

Thought plink could take VCF as input --vcf, --bcf?

ADD REPLY
1
Entering edit mode

This doesn't work; Shicheng's test was faulty.

ADD REPLY
0
Entering edit mode

Yes. I test it, it works. plink can take --vcf and --bcf as input. But I just want to get phased status and do some further analysis with R which I hope to take 'phased ped' as input. As chris said any files created by plink will remove phase status.

ADD REPLY
0
Entering edit mode

I have cleaned up this thread. It is good that everyone can share their opinion here, but I hope we can start fresh from now.

ADD REPLY

Login before adding your answer.

Traffic: 1491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6